Introduction

Here we test and extend in various ways the two hypotheses due to Angela Brown and Delwin Lindsey (Brown & Lindsey, 2004; Lindsey & Brown, 2002) linking, across populations and languages, the color lexicon to the UVR (Ultraviolet Radiation type A and B, 400–280nm; https://en.wikipedia.org/wiki/Ultraviolet) incidence and the frequency of abnormal color perception.

These hypotheses can be represented graphically as shown in the figure below:

Hypothesis

Data

Input files

We use the following input files:

Input file Format Description
Database_final4.csv CSV file (commas + quotes) database with all variables’ information except distance matrices
distmat_gen_add.csv CSV file (commas + quotes) distance matrix obtained with Alfred distance, and where missing values have been replaced with an additive method
distmat_gen_ult.csv CSV file (commas + quotes) distance matrix obtained with Alfred distance, and where missing values have been replaced with an ultrametric method
distmat_gen_na.csv CSV file (commas + quotes) distance matrix obtained with Alfred distance with missing values
qlc_data_dist.csv CSV file (commas + quotes) distance matrix obtained with phylogenetic distance on glottolog
qlc2glottocode.csv CSV file (commas + quotes) table converting the output names from glottolog to glottocodes (used with qlc_data_dist.csv)

Variables

In order to investigate our hypotheses, we analyzed a set of 142 populations dispersed across all continents. This dataset was assembled from three sources:

  • first, the data acompanying Brown & Lindsey (2004), concerning 118 populations and including information on the frequency of color blindness, color vocabulary, and UVR incidence;
  • second, Emma Meeussen’s (2015) expanded dataset, who added many populations to the (Brown & Lindsey, 2004) dataset, and changed some parameters;
  • third, we complemented these datasets with several ecological, cultural or genetic variables in order to control for possible confounding.

Variables previously defined and collected

Each population is identified by the glottocode (Hammarström, Bank, Forkel, & Haspelmath, 2018) of the (primary) language it speaks, as well as by its latitude, longitude and macroarea (also from Glottolog).

Longitude

For the statistical analyses, longitude was transformed as described in Data transformation.

Min. 1st Qu. Median Mean 3rd Qu. Max.
-16.39 24.91 50 88.86 118.2 312.4
***Figure 1.*** _Histogram of longitude._

Figure 1. Histogram of longitude.

***Figure 2.*** _Map of populations in our sample highlighting their longitude (color)._

Figure 2. Map of populations in our sample highlighting their longitude (color).

Latitude

For the statistical analyses, latitude was transformed as described in Data transformation.

Min. 1st Qu. Median Mean 3rd Qu. Max.
-38.29 4.739 23.34 22.97 42.87 69.38
***Figure 3.*** _Histogram of latitude._

Figure 3. Histogram of latitude.

***Figure 4.*** _Map of populations in our sample highlighting their latitude (color)._

Figure 4. Map of populations in our sample highlighting their latitude (color).

Macroarea

Africa Australia Eurasia North America Papunesia South America
31 2 79 9 9 12
***Figure 5.*** _Map of populations in our sample highlighting their macroarea (color)._

Figure 5. Map of populations in our sample highlighting their macroarea (color).

Number of color categories

In their previous investigations, Brown & Lindsey (2004) and Meeussen (2015) gathered information about color vocabularies according to the definition of Basic Color Categories from Berlin & Kay (1991). For this research, we selected only two variables from these linguistic data: the size of the color lexicon, and the presence of a specific term for blue color. The number of color categories (variable Nb_color_cat) varies between 2 and 12, whereas variable Blue can only have the values yes or no.

Min. 1st Qu. Median Mean 3rd Qu. Max.
2 5 7 7.662 10.75 12
***Figure 6.*** _Number of color categories._

Figure 6. Number of color categories.

***Figure 7.*** _Map of color categories (color scale)._

Figure 7. Map of color categories (color scale).

Specific term for Blue

no yes
60 82
***Figure 8.*** _Words for blue._

Figure 8. Words for blue.

***Figure 9.*** _Map of words for blue (color scale)._

Figure 9. Map of words for blue (color scale).

Daltonism

Data on abnormal red-green color perception for males was collected from 85 references, using the Ishihara test (Ishihara, 1917), the anomaloscope (Patent No. US3382025A, 1968), the Holmgren-Thomson wool test (Thomson, 1880) or the Hardy-Rand-Rittler pseudoisochromatic plate test (Hardy, Rand, & Rittler, 1954).

These data represents the percentage of red-green “color blind” males in the population, and varies between 0% and 11%. Overall “color blindness” rates were used, but when specific information was available, “color blindness” rate exclusively measures deuteranopia, deuteranomaly, protanopia and protanomaly. Thus, data on tritanopia was not included in this variable, since this implies abnormal color perception in the yellow-blue range. Samples where no subdivision between male and female was made, and containing data for less than 50 people, were excluded from this analysis.

Min. 1st Qu. Median Mean 3rd Qu. Max.
0 1.87 3.268 3.88 5.652 10.68
***Figure 10.*** _Incidence of daltonism._

Figure 10. Incidence of daltonism.

***Figure 11.*** _Map of incidence of daltonism (color scale)._

Figure 11. Map of incidence of daltonism (color scale).

UVR

The UVR incidence was calculated from the data available from the NASA Total Ozone Mapping Spectrometer (TOMS) at http://toms.gsfc.nasa.gov/ery_uv/new_uv/ for the whole year 1998. This variable is a measure of UVR radiation received by the human body and measured in joules per square meter (J/m2), and takes into account the thickness of the ozone layer in the stratosphere, the amount of cloud cover, the elevation, and how high the sun is in the sky. Please note that, for statistical analyses, this variable was z-scored.

Min. 1st Qu. Median Mean 3rd Qu. Max.
5.683 15.45 25.12 21.71 27.9 30.58
***Figure 12.*** _Distribution of UVR._

Figure 12. Distribution of UVR.

***Figure 13.*** _Map of incidence of UVR (color scale)._

Figure 13. Map of incidence of UVR (color scale).

Additional variables

Climate and weather

We also included here several additional variables that may impact on the amount of UVR received by the eye and on its effects, on the color lexicon, or on the color blindness rate.
Building on Bentz, Dediu, Verkerk, & Jäger (2018), historical data on global weather and climate were obtained from WorldClim, and 19 variables were extracted for the period 1960-1990, including various measures, such as temperature, seasonality or precipitation (see Bentz et al., 2018). A Principal Component Analysis (PCA) found that the first two principal components (PCs) explain most of the data and have meaningful interpretations: the PC1 explains 48.6% of the variance and reflects low seasonality, wet and hot climate, whereas PC2 explains 24.1% of the variance and reflects high seasonality, hot and dry climate (see Bentz et al., 2018 for details).

Climate PC1
Min. 1st Qu. Median Mean 3rd Qu. Max.
-10.06 -4.672 -2.032 -2.071 0.08 5.195
***Figure 14.*** _Distribution of climate PC1._

Figure 14. Distribution of climate PC1.

***Figure 15.*** _Map of climate PC1 (color scale)._

Figure 15. Map of climate PC1 (color scale).

Climate PC2
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.604 -2.464 -0.252 -0.3692 1.572 5.03
***Figure 16.*** _Distribution of climate PC2._

Figure 16. Distribution of climate PC2.

***Figure 17.*** _Map of climate PC2 (color scale)._

Figure 17. Map of climate PC2 (color scale).

Altitude

Altitude data was obtained from Bentz et al. (2018); please note that for the analyses, we use log(altitude).

Min. 1st Qu. Median Mean 3rd Qu. Max.
-Inf 4.636 5.725 -Inf 6.732 8.337
***Figure 18.*** _Distribution of log(altitude)._

Figure 18. Distribution of log(altitude).

***Figure 19.*** _Map of log(altitude) (color scale)._

Figure 19. Map of log(altitude) (color scale).

Distance to bodies of water

Distances to lakes, oceans, rivers, and water in general, were also obtained from Bentz et al. (2018); they were computed using OpenStreetMap raster file, while distance to water is the minimum distance to any sort of body of water.

Please note that for the analyses, these distances are log’ed.

Distance to lakes
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.451 1.636 2.679 2.618 3.362 5.687
***Figure 20.*** _Distribution of log distance to lakes._

Figure 20. Distribution of log distance to lakes.

***Figure 21.*** _Map of log distance to lakes (color scale)._

Figure 21. Map of log distance to lakes (color scale).

Distance to rivers
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2546 2.543 3.443 3.39 4.078 7.504
***Figure 22.*** _Distribution of log distance to rivers._

Figure 22. Distribution of log distance to rivers.

***Figure 23.*** _Map of log distance to rivers (color scale)._

Figure 23. Map of log distance to rivers (color scale).

Distance to oceans
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.223 4.007 5.133 4.968 6.252 7.104
***Figure 24.*** _Distribution of log distance to oceans._

Figure 24. Distribution of log distance to oceans.

***Figure 25.*** _Map of log distance to oceans (color scale)._

Figure 25. Map of log distance to oceans (color scale).

Distance to water
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.451 1.545 2.252 2.361 3.181 5.474
***Figure 26.*** _Distribution of log distance to water._

Figure 26. Distribution of log distance to water.

***Figure 27.*** _Map of log distance to water (color scale)._

Figure 27. Map of log distance to water (color scale).

Subsistence strategy

Subsistence strategy was obtained from AUTOTYP (Bickel et al., 2017) as coded in (Blasi et al., 2019), and represented by the binary variable hg: “yes” represents a population whose subsistence mode is based on hunting, fishing, gathering and/or foraging, whereas a “no” means that the subsistence mode of the population is centered around food production. As these data did not cover all the populations in our dataset, we expanded it by looking at other databases, such as D-Place (Kirby et al., 2016) and Seshat (Turchin et al., 2015).

no yes
128 14
***Figure 28.*** _Subsistence strategy._

Figure 28. Subsistence strategy.

***Figure 29.*** _Map of subsistence strategy (hunting-gathering)._

Figure 29. Map of subsistence strategy (hunting-gathering).

Population size

Population size was obtained from (Bentz et al., 2018), and we log’ed it.

Min. 1st Qu. Median Mean 3rd Qu. Max.
0 11.01 14.46 13.78 16.73 24.25
***Figure 30.*** _Log population size._

Figure 30. Log population size.

***Figure 31.*** _Map of log population size._

Figure 31. Map of log population size.

Language family

language family was obtained from (Bentz et al., 2018); there are 32 uniques language families, most language sbelonging to the Indo-European, Afro-Asiatic and Uralic families. The putative geographic origins of language families (latitude and longitude) were obtained from (Wichmann, Müller, & Velupillai, 2010) supplemented with information from Glottolog (Hammarström et al., 2018).

In the map, only 15 most important language families are represented. Other families are gathered under “Other”.

Table continues below
abkh1242 afro1255 ainu1252 araw1281 atha1245 atla1278 aust1305
1 13 1 1 2 19 3
Table continues below
aust1307 ayma1253 basq1248 chib1249 drav1251 eski1264 hadz1240
9 1 1 1 2 3 1
Table continues below
indo1319 japo1237 jiva1245 kore1284 maya1287 nilo1247 nucl1710
41 1 1 1 4 3 3
Table continues below
otom1299 pama1250 pano1259 sino1245 taik1256 ticu1244 tupi1275
1 1 1 9 1 1 1
turk1311 ural1272 utoa1244 yano1268
4 9 1 1
***Figure 32.*** _Language families. Only the 15 most represented language families are explicitely shown, the other being gathered in the umbrella category 'Other'._

Figure 32. Language families. Only the 15 most represented language families are explicitely shown, the other being gathered in the umbrella category ‘Other’.

***Figure 33.*** _Map of main language families. Only the 15 most represented language families are explicitely shown, the other being gathered in the umbrella category 'Other'._

Figure 33. Map of main language families. Only the 15 most represented language families are explicitely shown, the other being gathered in the umbrella category ‘Other’.

Putative origins of language families
***Figure 34.*** _Origins of the putative origins of language families._

Figure 34. Origins of the putative origins of language families.

Genetic distances

Unfortunately, there is apparently a lack of information concerning cross-population variation in the opsin genes (OPN1MW and OPN1LW). However, using a set of 96 micro-haplotypes and 382 SNP variants available from the ALFRED database (Rajeevan, 2003), we computed the overall genetic distance between population with Arlequin (Excoffier & Lischer, 2010). Missing values in the final distance matrix were estimated using an ultrametric (Lapointe & Kirsch, 1995) and additive (De Soete, 1984) data imputation methods (see Appendix).

Phylogenetic data was obtained from the Glottolog (Hammarström et al., 2018), and a distance matrix between our focus populations was created using qlcData and cophenetic in R (R Core Team, 2020). For more details, please see the section about Point Pattern Analysis.

Finally, we applied classic multi-dimensional scaling (cmdscale) to the previously computed distance matrix, and selected the minimum numbers of principal coordinates resulting in a goodness-of-fit of at least 50%.

Data transformation

For some statistical analyses (incuding regression models), we performed the following data transofrmations:

  • logarithm base 2 (log) was applied to distance to water, oceans, rivers, lakes, altitude, and population size, as their range on the original measurement scales were very wide with a few extreme outliers;
  • the latitude of the populations and language family origins were converted to a number ranging from 0.0 to 1.0 using cos(X*(pi/180)): these new values capture the closeness to the equator, with 1.0 representing a location on the equator, and 0.0 representing a location at one of the poles (the actual variation of our data ws between 0.4 and 1.0);
  • the longitude of the populations and language family origins, which originatelly varied between 0 and 360 (centered on the Pacific), were converted using the same function as above, and ranged between -1.0 and 1.0;
  • UVR incidence was z-scored.

Summary of the variables

Thus, a total of 19 variables (excluding the genetic multi-dimensional scaling dimensions) have been used for the analysis:

  • Blue: is there of a specific term for describing the color blue in the language? (yes/no)
  • erythemal UVR: z-score of the incidence
  • number of color categories
  • daltonism: proportion of males with abnormal red-green color perception
  • subsistence strategy: hunting-and-gathering? (yes/no)
  • population size: log base 2
  • altitude: log base 2
  • climate Principal Component 1
  • climate Principal Component 2
  • distance to water: log base 2
  • distance to lakes log base 2
  • distance to rivers : log base 2
  • distance to oceans: log base 2
  • latitude: rescaled
  • longitude: rescaled
  • latitude of the language family origin: rescaled
  • longitude of the language family origin: rescaled
  • language family
  • macroarea
  • dimensions from multi-dimensional scaling of genetic distances

Five distance matrices were also collected among the populations (see Appendix and Genetic distance from Alfred (FST) for more info):

  • one distance matrix gathering geographical distance “as the crow flies” between population
  • one phylogenetic distance matrix computed from glottolog information
  • one genetic distance matrix computed from ALFRED (see Appendix) without data imputation
  • one genetic distance matrix computed with ALFRED (see Appendix) with an additive data imputation
  • one genetic distance matrix computed with ALFRED (see Appendix) with an ultrametric data imputation

Point Pattern Analysis

We performed a Point Pattern Analysis (PPA) of our datapoints. PPA is a set of methods used for the detection of patterns and spatial arrangements and structures in a set of map locations (Spielman, 2017). Our dataset can be considered as a relatively random sample of populations, as it was driven by data availability for the color vocabulary and daltonism rate.

Our datapoints represent populations and their locations are defined by longitude and latitude on the Earth’s surface. Those points have marks attached to them, and we focus here on two such marks: a categorical mark (Blue) and a continous mark (daltonism). Covariates, such as elevation or climate, are a type of data treated as explanatory rather than as part of the a priori hypotheses, and they have been used in previous studies of linguistic diversity (e.g., Bentz et al., 2018; Everett, 2013). Here, we used statistical methods traditionnally used in PPA, such as comparaison with a Poisson process, modelling and spatial autocorrelation.

We also explored neighbouring coefficients and various distance matrices. The neighbouring coefficient describes the degree of similarity that populations share with their neighbours for different attributes, estimated with correlograms using Pearson correlations between the focus populations and their neighbours. Here, the neighbours were defined using various distance matrices, and we averaged the correlation coefficients across these distance matrices.

Here we list various resources that we used for these analyses:

Data

Point patterns

We used two different data formats: class ppp and class sp:

  • three variables of class ppp: one without marks, one with Blue as a mark, one with daltonism as a mark,
  • and various objects of class ppp representing point pattern datasets in the two-dimensional plane.

sp with marks

class       : SpatialPointsDataFrame 
features    : 142 
extent      : -176.207, 178.33, -38.2881, 69.3761  (xmin, xmax, ymin, ymax)
crs         : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
variables   : 2
names       : dalto, Blue 
min values  :     0,   no 
max values  : 10.68,  yes 

ppp with no marks

Planar point pattern: 142 points
window: rectangle = [-16.3916, 312.4226] x [-38.2881, 69.3761] units

ppp with Blue as mark

Marked planar point pattern: 142 points
Multitype, with levels = no, yes 
window: rectangle = [-16.3916, 312.4226] x [-38.2881, 69.3761] units

ppp with daltonism as mark

Marked planar point pattern: 142 points
marks are numeric, of storage type  'double'
window: rectangle = [-16.3916, 312.4226] x [-38.2881, 69.3761] units

Densities

***Figure 35.*** _Density surfaces for languages without a specific term for Blue._

Figure 35. Density surfaces for languages without a specific term for Blue.

***Figure 36.*** _Density surfaces for languages with a specific term for Blue._

Figure 36. Density surfaces for languages with a specific term for Blue.

We can see in the plots above that the densities of Blue=“no” and Blue=“yes” are different: population having a specific term for Blue are mostly clustured in Europe, whereas population lacking a word for Blue are mostly found in Africa, South America, the Middle East and South Asia.

***Figure 37.*** _Pairwise densitiy surfaces for Blue 'yes' and 'no'._

Figure 37. Pairwise densitiy surfaces for Blue ‘yes’ and ‘no’.

This graph includes two density surfaces interacting with each other: the density surface for “yes” condition, and the density surface for the “no” condition. In other terms, it computed the smoothed kernel intensity estimate for each condition (yes and no) and displayed scatterplots of the each condition pair.

Distance matrices

We used 5 different distance matrices and 1 neighbouring matrix, constructed using differents packages:

  • GeoDistanceInMetresMatrix, that gives pairwise distances between populations in kilometers (km), as the crow flies. It accounts for the natural curvature of the earth. (code found on [EurekaStatistic] (https://eurekastatistics.com/calculating-a-distance-matrix-for-geographic-points-using-r/)
  • Delaunay distance with natural borders: neighbouring matrix
  • Phylogenetic distance, computed using family classifications from Glottolog
  • Genetic distance computed using allele frequency data from ALFRED, using 3 different data imputation methods:
    • additive data imputation
    • ultrametric data imputation
    • no data imputation

GeoDistanceInMetresMatrix

Visualize only first 6 rows and columns:

  abua1244 acol1236 ainu1240 alba1267 alge1239 amha1245
abua1244 0 2879 13060 4223 3405 3703
acol1236 2879 0 11321 4330 4639 1187
ainu1240 13060 11321 0 9010 10329 10137
alba1267 4223 4330 9010 0 1591 3767
alge1239 3405 4639 10329 1591 0 4500
amha1245 3703 1187 10137 3767 4500 0

Delaunay distance

This is based on a method developed in Cysouw, Dediu, & Moran (2012) and adapted by Marc Tang: it computes neighbouring datapoints by taking into account the natural barriers to population movement represented by seas and oceans (see figures below).

nerror = 4 
Increasing madj from 176 to 212 and trying again.
nerror = 4 
Increasing madj from 212 to 255 and trying again.

Let’s visualize the neighbouring matrix for the 6 first rows:

id glottocode long lat 1 2 3 4 5 6 7 8 9 10 11
1 abua1244 6.615 4.831 49 99 139 NA NA NA NA NA NA NA NA
2 acol1236 32.51 3.577 13 55 63 73 74 96 NA NA NA NA NA
3 ainu1240 142.5 43.63 24 56 68 91 141 NA NA NA NA NA NA
4 alba1267 20 41 21 44 51 54 110 NA NA NA NA NA NA
5 alge1239 3.23 35.42 49 76 87 122 NA NA NA NA NA NA NA
6 amha1245 39.54 11.71 13 63 119 NA NA NA NA NA NA NA NA

Genetic distance from Alfred (FST)

Genetic distances were estimated using FST on data from the ALFRED database (see method in Appendix). Because this genetic distance matrix has many missing values, two imputation methods were used (in addition to using the distance matrix as it is): additive and ultrametic.

Let’s visualize the additive matrix, for the 6 first rows and columns:

  kaba1278 krio1252 mesc1238 alge1239 nyun1247 gusi1247
kaba1278 0 0.6572 0.4678 0.2358 0.2357 0.513
krio1252 0.6572 0 -0.3712 -0.7722 -0.9272 -0.3804
mesc1238 0.4678 -0.3712 0 -0.8432 -1.005 0.5988
alge1239 0.2358 -0.7722 -0.8432 0 0.2555 -0.8523
nyun1247 0.2357 -0.9272 -1.005 0.2555 0 -1.015
gusi1247 0.513 -0.3804 0.5988 -0.8523 -1.015 0

Let’s visualize the ultrametric matrix, for the 6 first rows and columns:

  kaba1278 krio1252 mesc1238 alge1239 nyun1247 gusi1247
kaba1278 0 0.6572 0.4678 0.2358 0.2357 0.513
krio1252 0.6572 0 0.05766 0.03598 0.04613 0.00367
mesc1238 0.4678 0.05766 0 0.05766 0.05766 0.5988
alge1239 0.2358 0.03598 0.05766 0 0.2555 0.03598
nyun1247 0.2357 0.04613 0.05766 0.2555 0 0.04613
gusi1247 0.513 0.00367 0.5988 0.03598 0.04613 0

Let’s visualize the no data imputation matrix, for the 6 first rows and columns:

  kaba1278 krio1252 mesc1238 alge1239 nyun1247 gusi1247
kaba1278 0 0.6572 0.4678 0.2358 0.2357 0.513
krio1252 0.6572 0 NA NA NA NA
mesc1238 0.4678 NA 0 NA NA 0.5988
alge1239 0.2358 NA NA 0 0.2555 NA
nyun1247 0.2357 NA NA 0.2555 0 NA
gusi1247 0.513 NA 0.5988 NA NA 0

Phylogenetic distance from Glottolog

The table qlc_data_dist.csv was computed using functions QlcData and cophenetic.

Let’s visualize the phylogenetic matrix, for the 6 first rows and columns:

  pare1266 carn1240 viet1252 fiji1243 wall1257 samo1305
pare1266 0 150.6 150.6 188.2 188.2 188.2
carn1240 150.6 0 150.6 188.2 188.2 188.2
viet1252 150.6 150.6 0 188.2 188.2 188.2
fiji1243 188.2 188.2 188.2 0 94.12 94.12
wall1257 188.2 188.2 188.2 94.12 0 70.59
samo1305 188.2 188.2 188.2 94.12 70.59 0

Notes on methods

For brms, to select the best model, we used Bayes factors, WAIC, LOO and KFOLD. Please note that, form brms, there might be differences between Bayes factors, on the one hand, and WAIC/LOO/KFOLD, on the other, due to the default use of improper priors (see, for example, https://stats.stackexchange.com/questions/407964/bayes-factors-and-predictive-accuracy-in-model-comparison-in-rstan-brms); therefore, we will both methods for model selection.

Results

Point Pattern Analysis

Spatial randomness

Is our data randomly distributed in space?*

Under the assumption of CSR (complete spatial randomness), points are independent of each other and have the same likelihood of being found at any location. These assumptions are almost never met: instead, the location of points is driven by point attributes (marks), other covariates, and random noise.

Poisson is a discrete probability distribution; let’s visualize it:

We can think of our null hypothesis for the spatial distribution of our points as one of complete spatial randomness (CSR). By comparing our data to this null of randomness, we can determine whether our point process is significantly departing from spatial randomness by being either clustered or regularly spaced.

χ2 test on quadrat counts

The “classic methods” is to compare our data to a CSR process using a χ2 test based on quadrat counts, or a Monte-Carlo test.

Chi-squared test of CSR using quadrat counts: mydatappp
Test statistic df P value Alternative hypothesis
401 74 4.202e-46 * * * two.sided
Conditional Monte Carlo test of CSR using quadrat counts unsing Pearson X2 statistic
Test statistic P value Alternative hypothesis
401 0.001 * * * two.sided

We can see that both produce very low p-values, so we can reject the null hypothesis of CSR.

Kolmogorov-Smirnov test

The Kolmogorov-Smirnov test compares our empirical (sampled) data to a theoretical (in our case, CSR) ideal and see if the two are significantly different. With point data, we specify a real function \(T(x,y)\) defined at all locations \(x,y\) in our sampling window, and we evaluate it at each of the data points and compare the empirical values of \(T\) with the predicted distribution of \(T\) under the CSR assumptions.

***Figure 38.*** _Kolmogorov-Smirnov versus the CSR using only the 'x' coordinate._

Figure 38. Kolmogorov-Smirnov versus the CSR using only the ‘x’ coordinate.

We can also input a density function as a covariate, to estimate overall spatial randomness:

***Figure 39.*** _Kolmogorov-Smirnov versus the CSR using on two dimensions._

Figure 39. Kolmogorov-Smirnov versus the CSR using on two dimensions.

We can see that the observed and expected (if our data were CSR) cumulative distributions are pretty different.

Conclusions

Thus, our tests of spatial randomness were performed in order to compare the distribution of our datapoints to the distribution of a Poisson Process, where points have the same likelihood of being found at any location. Both the χ2 test and Monte-Carlo test suggest that the distribution of our datapoints is significantly different from a Poisson Process, which is also supported the Kolmogorov-Smirnov test. In a way, these results confirm the obvious point that our data distribution is not randomly distributed over the whole surface of the Earth!

How is our data distributed not randomly?

Our point process is not random, but how? Is it because of datapoints are regularly space or clustered?

We will compare the distances between datapoints; these distances can be of several kinds:

  • Nearest neighbor distances: G function
  • Empty space distances: F function
  • Pairwise distance: K function
G function
Function value object (class 'fv')
for the function r -> G(r)
.....................................................................
        Math.label      Description                                  
r       r               distance argument r                          
theo    G[pois](r)      theoretical Poisson G(r)                     
han     hat(G)[han](r)  Hanisch estimate of G(r)                     
rs      hat(G)[bord](r) border corrected estimate of G(r)            
km      hat(G)[km](r)   Kaplan-Meier estimate of G(r)                
hazard  hat(h)[km](r)   Kaplan-Meier estimate of hazard function h(r)
theohaz h[pois](r)      theoretical Poisson hazard function h(r)     
.....................................................................
Default plot formula:  .~r
where "." stands for 'km', 'rs', 'han', 'theo'
Recommended range of argument r: [0, 8.324]
Available range of argument r: [0, 30.226]
***Figure 40.*** _The G function._

Figure 40. The G function.

For all values of r (the distances between points) our empirical function is greater than the theoretical one, suggesting that nearest neighbor distances among our data points are shorter than for a Poisson process, suggesting a clustered pattern.

F function
Function value object (class 'fv')
for the function r -> F(r)
.....................................................................
        Math.label      Description                                  
r       r               distance argument r                          
theo    F[pois](r)      theoretical Poisson F(r)                     
cs      hat(F)[cs](r)   Chiu-Stoyan estimate of F(r)                 
rs      hat(F)[bord](r) border corrected estimate of F(r)            
km      hat(F)[km](r)   Kaplan-Meier estimate of F(r)                
hazard  hat(h)[km](r)   Kaplan-Meier estimate of hazard function h(r)
theohaz h[pois](r)      theoretical Poisson hazard h(r)              
.....................................................................
Default plot formula:  .~r
where "." stands for 'km', 'rs', 'cs', 'theo'
Recommended range of argument r: [0, 30.281]
Available range of argument r: [0, 30.281]
***Figure 41.*** _The F function._

Figure 41. The F function.

Empirical values above the theoretical values (blue) indicate that empty space distances in our empirical point pattern are shorter than for a Poisson process, suggesting again a clustered pattern.

K function
Function value object (class 'fv')
for the function r -> K(r)
..............................................................
       Math.label       Description                           
r      r                distance argument r                   
theo   K[pois](r)       theoretical Poisson K(r)              
border hat(K)[bord](r)  border-corrected estimate of K(r)     
trans  hat(K)[trans](r) translation-corrected estimate of K(r)
iso    hat(K)[iso](r)   isotropic-corrected estimate of K(r)  
..............................................................
Default plot formula:  .~r
where "." stands for 'iso', 'trans', 'border', 'theo'
Recommended range of argument r: [0, 26.916]
Available range of argument r: [0, 26.916]
***Figure 42.*** _The K function._

Figure 42. The K function.

This test confirms the previous results: our data seems to be clustered.

Conclusions

Thus, the G function (nearest neighbour distance), F function (empty space distance) and K function (pairwise distance) plots all show a clustered pattern.

Elevation as a covariate

We test here if elevation is driving the location of our datapoints; we create two fitting function for our point process: one model with a Poisson Process and one model with a Poisson Process + elevation as a covariate.

***Figure 43.*** _Elevation as a covariate._

Figure 43. Elevation as a covariate.

We use the function ppm (point process model) to fit a model (this is analogous to the model fitting functions in R such as lm and glm). The statistic \(S(u)\) is specified by an R language formula, like the formulas used to specify the systematic relationship in a linear model orgeneralised linear model.

Nonstationary Poisson process

Log intensity:  ~side(x)

Fitted trend coefficients:
 (Intercept) side(x)right 
  -4.8026460   -0.2792705 

               Estimate       S.E.   CI95.lo   CI95.hi Ztest        Zval
(Intercept)  -4.8026460 0.02207554 -4.845913 -4.759379   *** -217.555092
side(x)right -0.2792705 0.03364014 -0.345204 -0.213337   ***   -8.301703

We plot the predictions of the fitting functions:

  1. Poisson model with elevation as a covariate
  2. Poisson model fitted with intensity that is proportional to elevation
***Figure 44.*** _Poisson model with elevation as a covariate._

Figure 44. Poisson model with elevation as a covariate.

***Figure 45.*** _Poisson model fitted with intensity that is proportional to elevation._

Figure 45. Poisson model fitted with intensity that is proportional to elevation.

The summary of the fitting function with elevation as a covariate:

Point process model
Fitting method: maximum likelihood (Berman-Turner approximation)
Model was fitted using glm()
Algorithm converged
Call:
ppm.ppp(Q = mydatappp2, trend = ~cov1, covariates = list(cov1 = ele))
Edge correction: "border"
    [border correction distance r = 0 ]
--------------------------------------------------------------------------------
Quadrature scheme (Berman-Turner) = data + dummy + weights

Data pattern:
Planar point pattern:  142 points
Average intensity 0.00372 points per square unit
Window: rectangle = [-176.207, 178.33] x [-38.2881, 69.3761] units
                    (354.5 x 107.7 units)
Window area = 38170.9 square units

Dummy quadrature points:
     32 x 32 grid of dummy points, plus 4 corner points
     dummy spacing: 11.079281 x  3.364506 units

Original dummy parameters: =
Planar point pattern:  1028 points
Average intensity 0.0269 points per square unit
Window: rectangle = [-176.207, 178.33] x [-38.2881, 69.3761] units
                    (354.5 x 107.7 units)
Window area = 38170.9 square units
Quadrature weights:
     (counting weights based on 32 x 32 array of rectangular tiles)
All weights:
    range: [5.33, 37.3] total: 38200
Weights on data points:
    range: [5.33, 18.6] total: 2030
Weights on dummy points:
    range: [5.33, 37.3] total: 36100
--------------------------------------------------------------------------------
FITTED MODEL:

Nonstationary Poisson process

---- Intensity: ----

Log intensity: ~cov1
Model depends on external covariate 'cov1'
Covariates provided:
    cov1: im

Fitted trend coefficients:
 (Intercept)         cov1 
-5.269965833  0.000493119 

                Estimate         S.E.       CI95.lo       CI95.hi Ztest
(Intercept) -5.269965833 8.704763e-02 -5.4405760603 -5.0993556058   ***
cov1         0.000493119 4.080347e-05  0.0004131457  0.0005730923   ***
                 Zval
(Intercept) -60.54117
cov1         12.08522

----------- gory details -----

Fitted regular parameters (theta):
 (Intercept)         cov1 
-5.269965833  0.000493119 

Fitted exp(theta):
(Intercept)        cov1 
0.005143786 1.000493241 

We compared the models using the likelihood ratio test of the null hypothesis of a homogeneous Poisson process (CSR) against the alternative of an inhomogeneous Poisson process with intensity that is a function of the elevation covariate.

Analysis of Deviance Table
Npar Df Deviance Pr(>Chi)
1 NA NA NA
2 1 171.4 3.629e-39

The p-value is very small, rejecting CSR in favour of the alternative model.

Spatial autocorrelation

Spatial autocorrelation describes the degree to which variables are similar to each other at different spatial locations. We computed Moran’s index (Moran, 1950), which is a measure of global spatial autocorrelation, at different distance matrices.

We will use 3 different methods to compute spatial autocorrelation with Moran’s index:

  • inverse distance matrix
  • bins of distance (above or below a given lag)
  • nearest neighbour

Inverse distance matrix

Moran’s I for daltonism:

$observed
[1] 0.1760609

$expected
[1] -0.007092199

$sd
[1] 0.02082568

$p.value
[1] 0

Moran’s I for Blue:

$observed
[1] 0.2024622

$expected
[1] -0.007092199

$sd
[1] 0.02093014

$p.value
[1] 0

Thus, there is a significative positive spatial autocorrelation for both Blue and daltonism.

Inverse distance matrix with bins of distance

Instead of using the whole distance matrix, we will use bins of distance. For example, populations living close enough (between a certain distance range, such as between 0 and 1000 km away) will be considered as being close, others will be considered as being far. The information conveyed in the distance matrix will be converted to into binary information.

We will use 2 differents ranges :

  • between 0 and 1000 km, and
  • between 0 and 2000 km.

Then, the average number of neighbours per population will increase when the range is large.

Moran’s I for daltonism, bins=2000 and bins=1000:

$observed
[1] 0.3764329

$expected
[1] -0.007092199

$sd
[1] 0.0445232

$p.value
[1] 0
$observed
[1] 0.4416968

$expected
[1] -0.007092199

$sd
[1] 0.06377459

$p.value
[1] 1.962652e-12

Moran’s I for Blue, bins=2000 and bins=1000:

$observed
[1] 0.3244207

$expected
[1] -0.007092199

$sd
[1] 0.04474737

$p.value
[1] 1.276756e-13
$observed
[1] 0.3694625

$expected
[1] -0.007092199

$sd
[1] 0.06409501

$p.value
[1] 4.229832e-09

We can see that the results with the 3 different bins are very similar, showing a positive significative autocorrelation, and very similar to the results from the inverse distance matrix.

Nearest Neighbour with Delaunay distance

For each language, we compute its nearest neighbours:

id glottocode long lat 1 2 3 4 5 6 7 8 9 10 11
1 abua1244 6.61492 4.83057 49 99 139 NA NA NA NA NA NA NA NA
2 acol1236 32.51470 3.57738 13 55 63 73 74 96 NA NA NA NA NA
3 ainu1240 142.46167 43.63365 24 56 68 91 141 NA NA NA NA NA NA
4 alba1267 20.00000 41.00000 21 44 51 54 110 NA NA NA NA NA NA
5 alge1239 3.23033 35.42080 49 76 87 122 NA NA NA NA NA NA NA
6 amha1245 39.54346 11.70818 13 63 119 NA NA NA NA NA NA NA NA

Moran’s I for daltonism:


    Moran I test under normality

data:  dbdalto$daltonism  
weights: gg2    

Moran I statistic standard deviate = 7.478, p-value = 3.773e-14
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.414876086      -0.007092199       0.003184125 

Moran’s I for Blue:


    Moran I test under normality

data:  as.numeric(dbdalto$Blue)  
weights: gg2    

Moran I statistic standard deviate = 5.5324, p-value = 1.58e-08
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.305088031      -0.007092199       0.003184125 

Montecarlo test for daltonism:


    Monte-Carlo simulation of Moran I

data:  dbdalto$daltonism 
weights: gg2  
number of simulations + 1: 100 

statistic = 0.41488, observed rank = 100, p-value = 0.01
alternative hypothesis: greater

Montecarlo test for Blue:


    Monte-Carlo simulation of Moran I

data:  as.numeric(dbdalto$Blue) 
weights: gg2  
number of simulations + 1: 100 

statistic = 0.30509, observed rank = 100, p-value = 0.01
alternative hypothesis: greater
UVR and Number of color categories

UVB and Number of color categories are two variables of interest in our dataset; let’s have a look to their spatial autocorrelation:

Moran’s I for UVR:

$observed
[1] 0.3912663

$expected
[1] -0.007092199

$sd
[1] 0.020865

$p.value
[1] 0

Moran’s I for number of color categories:

$observed
[1] 0.1733223

$expected
[1] -0.007092199

$sd
[1] 0.0208884

$p.value
[1] 0

There is also a high positive spatial autocorrelation, especially UVR.

Conclusions

Multiple methods (inverse distance matrix, bins of distances, and Delaunay neighbours) emphasize the high degree of spatial autocorrelation for Blue and daltonism.

Correlograms: neighbouring coefficient

Here, we use the distance matrices explained above to compute the autocorrelations of the variables Blue and daltonism, as we aim to find which distance matrix results in the highest autocorrelations.

The number of neighbours is manually fixed at 10:

Steps used to compute the correlation coefficient (method 1):

  • for each distance matrix, create a table with the K nearest neighbours (K = 1:10)
  • for each distance matrix, create a table where each row represents a population: the first column contains the values for daltonism (or Blue) for that particular population, while the other columns contain the values of daltonism (or Blue) for the neighbours of this population
  • compute Pearson’s correlation between the focus column and each neighbour columns

Thus, the value for KNN=i represents the correlation between the focus population and only the values of the neighbours i steps away.

We also computed (method 2) the correlation between the focus populations and all its neighbours closer than KNN=i. For example, for KNN=5, the correlation is computed between the focus population and the mean of all its neighbours that are included in the 5 steps radius (KNN=1,2,3,4 and 5). This method is probably more appropriate for our analysis here, as we want to study neighbours in general.

For daltonism

***Figure 46.*** _Method 1, for daltonism._

Figure 46. Method 1, for daltonism.

***Figure 47.*** _Method 2, for daltonism._

Figure 47. Method 2, for daltonism.

We used t-test to compare the means of the two autocorrelations:

Welch Two Sample t-test: df.all.mean$corr[df.all.mean$condition == "Genetic_ult"] and df.all.mean$corr[df.all.mean$condition == "Geographic"] (continued below)
Test statistic df P value Alternative hypothesis mean of x
2.247 9.557 0.04964 * two.sided 0.5349
mean of y
0.4982

For Blue

We apply the same steps with Blue variable (converted from yes/no into a numerical binary variable 0/1).

***Figure 48.*** _Method 1, for blue_

Figure 48. Method 1, for blue

***Figure 49.*** _Method 2, for blue_

Figure 49. Method 2, for blue

Welch Two Sample t-test: df.all.mean$corr[df.all.mean$condition == "Genetic_ult"] and df.all.mean$corr[df.all.mean$condition == "Geographic"] (continued below)
Test statistic df P value Alternative hypothesis mean of x
4.087 11.5 0.001641 * * two.sided 0.5468
mean of y
0.4471

Conclusions

Thus, we found a high neighbouring coefficient for Blue (r=0.62 with KNN=3 with genetic distance matrix) and daltonism (r=0.57 with KNN=3 with genetic distance matrix) variables, supporting the finding of a high spatial autocorrelation found for these variables.

Moreover, geographic and genetic distance matrices were compared in terms of which generates the highest neighbouring coefficient. Interestingly, the variants of genetic distance (no data imputation method, additive data imputation, ultrametric data imputation) and the phylogenetic distance (from Glottolog) resultd in very different results. When the number of nearest neighbours is < 8, the genetic distance with ultrametric imputation explained best the patterning of our data points, for both Blue and daltonism. In particular, it resulted in a significantly higher similarity between neighbouring coefficients than geographical distance. When the number of nearest neighbours is >= 9, all distance matrices tend to produce equivalent rates of similarity.

Hypothesis 1: does the number of color categories predict the presence of a word for “blue”?

Exploratory analysis

We first investigated if the number of color categories may predict variable Blue. Indeed, according to Berlin & Kay (1991), color categories are acquired through successive steps. Thus, a language possessing 6 color words will possess a word for blue, whereas a language having only 5 color terms will not. If this theory is correct, then the number of color categories should entirely predict the presence or absence of a specific blue term. However, several authors have noticed that it does not account for all languages.

In our dataset, only 60.56 % of langage follow Berlin and Kay’s stages. The distributions of the number of color categories and the presence of a word for Blue overlaps:

***Figure 50.*** _The distribution of the number of color categories for language with (blue) and without (red) a word for Blue._

Figure 50. The distribution of the number of color categories for language with (blue) and without (red) a word for Blue.

In our data, 4.9% of the languages with less than 6 color words do have a word for blue, while 20.8% of languages with more than 6 color words do not have a word for blue.

Thus, for our data, while the number of color categories is indeed an important predictor of the presence of a word for blue (as predicted by Berlin and Kay’s stages theory), it does not, however, account for all the variation in our dataset.

Mediation analysis

Second, we carried out a mediation analysis (using the brm function from brms package in R: Bürkner, 2018a and the mediation function from spatstat; Baddeley, Rubak, & Turner, 2015) to test whether the number of color categories is a mediating factor between UVR incidence and the presence of a dedicated word for blue, and whether the relation between the latter and latitude is mediated by UVR incidence.

Due to current limitations in the brm function, we can not study study the effect of latitude quadratic, but only latitude linear.

Mediation path diagrams.

Interpretation of the mediation output:

  • Indirect effect = this is the indirect effect of X on the Y that goes through the mediator M. They are computed using (a*b).
  • Direct effect = it describes the direct effect of the X on the Y. They are represented with the (c').
  • Total Effect = total effect (direct + indirect) of the X on the Y. They are represented with (c'), which has been computed using c = c' + ab.
  • Prop. Mediated = proportion of the effect of the X on the Y that goes through M.

UVR –> Number of color categories –> Blue

Bayes model result and diagnostic
 Family: MV(gaussian, bernoulli) 
  Links: mu = identity; sigma = identity
         mu = logit 
Formula: Nb_color_cat ~ UVB_z + (1 | macroarea) + (1 | family_code) 
         Blue ~ UVB_z + Nb_color_cat + (1 | macroarea) + (1 | family_code) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
         total post-warmup samples = 12000

Group-Level Effects: 
~family_code (Number of levels: 32) 
                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Nbcolorcat_Intercept)     1.22      0.36     0.59     2.01 1.00    10727
sd(Blue_Intercept)           0.60      0.50     0.02     1.86 1.00    11097
                         Tail_ESS
sd(Nbcolorcat_Intercept)    11104
sd(Blue_Intercept)          11768

~macroarea (Number of levels: 6) 
                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Nbcolorcat_Intercept)     0.89      0.76     0.04     2.81 1.00    10098
sd(Blue_Intercept)           0.84      0.93     0.02     3.25 1.00     9632
                         Tail_ESS
sd(Nbcolorcat_Intercept)    11163
sd(Blue_Intercept)          10203

Population-Level Effects: 
                     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Nbcolorcat_Intercept     6.99      0.62     5.78     8.19 1.00    11649
Blue_Intercept          -6.25      1.33    -9.02    -3.78 1.00    11637
Nbcolorcat_UVB_z        -0.92      0.26    -1.43    -0.42 1.00    11380
Blue_UVB_z              -0.98      0.36    -1.72    -0.31 1.00    11968
Blue_Nb_color_cat        0.92      0.17     0.62     1.28 1.00    12038
                     Tail_ESS
Nbcolorcat_Intercept    11487
Blue_Intercept          11724
Nbcolorcat_UVB_z        11468
Blue_UVB_z              12142
Blue_Nb_color_cat       11606

Family Specific Parameters: 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_Nbcolorcat     2.19      0.14     1.93     2.48 1.00    11725    11052

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Mediation result

With prob = 0.9:


# Causal Mediation Analysis for Stan Model

  Treatment: UVB_z
   Mediator: Nb_color_cat
   Response: Blue

                 Estimate     HDI (90%)
  Direct effect:    -0.97 [-1.55 -0.38]
Indirect effect:    -0.83 [-1.31 -0.38]
   Total effect:    -1.80 [-2.61 -1.07]

Proportion mediated: 46.14% [25.42% 66.86%]

With prob = 0.95:


# Causal Mediation Analysis for Stan Model

  Treatment: UVB_z
   Mediator: Nb_color_cat
   Response: Blue

                 Estimate     HDI (95%)
  Direct effect:    -0.97 [-1.66 -0.26]
Indirect effect:    -0.83 [-1.44 -0.33]
   Total effect:    -1.80 [-2.78 -0.94]

Proportion mediated: 46.14% [20.99% 71.28%]

Please note that as Blue (outcome) is binary, direct and indirect effects may be on different scales.

Comparaison of predicted and observed values for the mediator and outcome

The faint lines (yrep) are posterior predictive draws, and y is the observed distribution.

The faint lines (yrep) are posterior predictive draws.

latitude –> UVR –> Blue

Bayes model result and diagnostic
 Family: MV(gaussian, bernoulli) 
  Links: mu = identity; sigma = identity
         mu = logit 
Formula: UVB_z ~ lat_rescaled + (1 | macroarea) + (1 | family_code) 
         Blue ~ UVB_z + lat_rescaled + (1 | macroarea) + (1 | family_code) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
         total post-warmup samples = 12000

Group-Level Effects: 
~family_code (Number of levels: 32) 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(UVBz_Intercept)     0.13      0.07     0.01     0.27 1.00     9076     9369
sd(Blue_Intercept)     1.00      0.74     0.05     2.83 1.00     8537    10554

~macroarea (Number of levels: 6) 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(UVBz_Intercept)     0.14      0.13     0.01     0.45 1.00     9205    10401
sd(Blue_Intercept)     1.53      1.35     0.09     5.14 1.00    10311    10830

Population-Level Effects: 
                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
UVBz_Intercept       -4.69      0.23    -5.16    -4.25 1.00    11617    11602
Blue_Intercept       -6.94      4.16   -15.28     1.04 1.00    11893    11788
UVBz_lat_rescaled     5.60      0.24     5.15     6.08 1.00    11785    11690
Blue_UVB_z           -2.55      0.82    -4.24    -1.03 1.00    11715    11688
Blue_lat_rescaled     8.23      4.79    -0.96    17.85 1.00    11901    11766

Family Specific Parameters: 
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_UVBz     0.35      0.02     0.31     0.39 1.00    11987    11929

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Mediation result

With prob = 0.9:


# Causal Mediation Analysis for Stan Model

  Treatment: lat_rescaled
   Mediator: UVB_z
   Response: Blue

                 Estimate      HDI (90%)
  Direct effect:     8.15 [  0.40 16.16]
Indirect effect:   -14.07 [-21.76 -6.62]
   Total effect:    -5.90 [ -9.65 -2.42]

Proportion mediated: 238.42% [40.55% 436.29%]

With prob = 0.95:


# Causal Mediation Analysis for Stan Model

  Treatment: lat_rescaled
   Mediator: UVB_z
   Response: Blue

                 Estimate     HDI (95%)
  Direct effect:     8.15 [ -1.0 17.81]
Indirect effect:   -14.07 [-23.3 -5.36]
   Total effect:    -5.90 [-10.5 -1.93]

Proportion mediated: 238.42% [-28.88% 505.72%]

Please note that as Blue (outcome) is binary, direct and indirect effects may be on different scales.

Comparaison of predicted and observed values for the mediator and outcome

The faint lines (yrep) are posterior predictive draws, and y is the observed distribution.

Putative latitude of language family origin –> UVR –> Blue

Bayes model result and diagnostic
 Family: MV(gaussian, bernoulli) 
  Links: mu = identity; sigma = identity
         mu = logit 
Formula: UVB_z ~ lat_fam_rescaled + (1 | macroarea) 
         Blue ~ UVB_z + lat_fam_rescaled + (1 | macroarea) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
         total post-warmup samples = 12000

Group-Level Effects: 
~macroarea (Number of levels: 6) 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(UVBz_Intercept)     0.35      0.22     0.10     0.91 1.00    10684    11489
sd(Blue_Intercept)     1.83      1.36     0.37     5.39 1.00     9506    10149

Population-Level Effects: 
                      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
UVBz_Intercept           -2.78      0.42    -3.58    -1.95 1.00    11599
Blue_Intercept           -2.78      2.18    -7.04     1.54 1.00    11883
UVBz_lat_fam_rescaled     3.42      0.41     2.60     4.22 1.00    11675
Blue_UVB_z               -1.56      0.40    -2.38    -0.83 1.00    12187
Blue_lat_fam_rescaled     3.45      2.29    -1.03     8.13 1.00    12344
                      Tail_ESS
UVBz_Intercept           11319
Blue_Intercept           11788
UVBz_lat_fam_rescaled    11206
Blue_UVB_z               11842
Blue_lat_fam_rescaled    12213

Family Specific Parameters: 
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_UVBz     0.71      0.04     0.63     0.80 1.00    11857    11534

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Mediation result

With prob = 0.9:


# Causal Mediation Analysis for Stan Model

  Treatment: lat_fam_rescaled
   Mediator: UVB_z
   Response: Blue

                 Estimate     HDI (90%)
  Direct effect:     3.42 [-0.29  7.17]
Indirect effect:    -5.22 [-7.77 -2.88]
   Total effect:    -1.83 [-4.84  1.14]

Proportion mediated: 285.68% [-1069.29% 1640.65%]

With prob = 0.95:


# Causal Mediation Analysis for Stan Model

  Treatment: lat_fam_rescaled
   Mediator: UVB_z
   Response: Blue

                 Estimate     HDI (95%)
  Direct effect:     3.42 [-1.19  7.90]
Indirect effect:    -5.22 [-8.37 -2.53]
   Total effect:    -1.83 [-5.56  1.51]

Proportion mediated: 285.68% [-2377.77% 2949.13%]

Please note that as Blue (outcome) is binary, direct and indirect effects may be on different scales.

Comparaison of predicted and observed values for the mediator and outcome

The faint lines (yrep) are posterior predictive draws, and y is the observed distribution.

Conclusions

The mediation analysis shows that UVR has both a direct and an indirect effect on Blue, the latter mediated by the number of color categories (when having macroarea and family as random effects).

On the other hand, latitude (and latitude2) does have an effect on Blue, but this is fully mediated by UVR incidence (when having macroarea and family as random effects).

Regression models

UVR incidence only

We model here the effects of UVR incidence on Blue using Bayesian mixed-effects models (implemented in package brms; Bürkner (2018b)). Because the repartition of populations is uneven with respect to geographical location and language family, we used language family and macroarea as random effects in all our models.

 Family: bernoulli 
  Links: mu = logit 
Formula: Blue ~ UVB_z + (1 | macroarea) + (1 | family_code) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
         total post-warmup samples = 12000

Group-Level Effects: 
~family_code (Number of levels: 32) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.14      0.83     0.06     3.14 1.00     7686    10150

~macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.52      1.31     0.08     4.96 1.00     9483    10686

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -0.02      0.95    -1.89     1.97 1.00    11483    11194
UVB_z        -1.33      0.37    -2.14    -0.68 1.00    10917    11068

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval

Parameter |        95% HDI
--------------------------
Intercept | [-1.99,  1.85]
UVB_z     | [-2.08, -0.63]
# Proportion of samples inside the ROPE [-0.01, 0.01]:

Parameter | inside ROPE
-----------------------
Intercept |      1.25 %
UVB_z     |      0.00 %
    Parameter ROPE_low ROPE_high p_ROPE Effects   Component
1 b_Intercept    -0.01      0.01  0.012   fixed conditional
2     b_UVB_z    -0.01      0.01  0.000   fixed conditional

While we also included random slopes, they did not make any significant contribution, and we retained the random intercepts-only model (Bayes factor 0.02, very strong evidence for random intercepts only).

UVR has a strong negative effect on Blue (Bayes factor 8963.18, extreme evidence for model with UVR incidence; 95% HDI for βUVR = [-2.08, -0.63]), explaining R2 = 35.73% of the variance, and correctly classifying 73.46% of the populations (recall = 77.57%, precision = 78.22%).

Adding covariates

Using Bayes factors

The best model is:

 Family: bernoulli 
  Links: mu = logit 
Formula: Blue ~ UVB_z + Nb_color_cat + log_dist2lakes + hg + lat_fam_rescaled + lon_fam_rescaled + lat_rescaled + lon_rescaled + I(lat_fam_rescaled^2) + I(lon_fam_rescaled^2) + I(lat_rescaled^2) + I(Nb_color_cat^2) + (1 | macroarea) + (1 | family_code) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
         total post-warmup samples = 12000

Group-Level Effects: 
~family_code (Number of levels: 32) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     3.03      1.89     0.25     7.56 1.00     8678     9294

~macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     3.40      2.92     0.14    10.93 1.00    10783    11528

Population-Level Effects: 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept             -22.45     22.24   -70.79    18.88 1.00    11852    11360
UVB_z                  -5.20      1.84    -9.10    -1.90 1.00    11887    11621
Nb_color_cat            7.78      2.41     3.58    13.07 1.00    10925    11669
log_dist2lakes         -1.15      0.44    -2.10    -0.39 1.00    11022    11458
hgyes                  -3.42      2.62    -8.91     1.31 1.00    11000    11198
lat_fam_rescaled      -59.36     70.46  -198.99    82.77 1.00    11770    11475
lon_fam_rescaled       -1.80      2.74    -7.63     3.16 1.00    11616    11561
lat_rescaled            8.09     40.79   -74.76    86.81 1.00    11722    11687
lon_rescaled            2.86      2.08    -0.92     7.28 1.00    11511    11438
Ilat_fam_rescaledE2    29.51     47.64   -70.50   122.99 1.00    11720    11488
Ilon_fam_rescaledE2     2.60      4.39    -4.79    12.96 1.00    10876    11311
Ilat_rescaledE2        19.45     26.13   -29.54    73.27 1.00    11343    11114
INb_color_catE2        -0.39      0.14    -0.70    -0.14 1.00    11119    11432

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval

# Fixed Effects (Conditional Model)

Parameter           |           95% HDI
---------------------------------------
Intercept           | [ -67.42,  21.61]
UVB_z               | [  -8.93,  -1.78]
Nb_color_cat        | [   3.29,  12.61]
log_dist2lakes      | [  -2.02,  -0.34]
hgyes               | [  -8.72,   1.47]
lat_fam_rescaled    | [-198.14,  83.52]
lon_fam_rescaled    | [  -7.45,   3.30]
lat_rescaled        | [ -71.20,  89.50]
lon_rescaled        | [  -1.02,   7.18]
Ilat_fam_rescaledE2 | [ -70.56, 122.98]
Ilon_fam_rescaledE2 | [  -5.05,  12.40]
Ilat_rescaledE2     | [ -30.34,  72.22]
INb_color_catE2     | [  -0.68,  -0.13]

# Random Effects (Conditional Model)

Parameter       |           95% HDI
-----------------------------------
Nb_color_cat    | [   3.29,  12.61]
INb_color_catE2 | [  -0.68,  -0.13]
# Proportion of samples inside the ROPE [-0.01, 0.01]:

# Fixed Effects (Conditional Model)

Parameter           | inside ROPE
---------------------------------
Intercept           |      0.03 %
UVB_z               |      0.00 %
Nb_color_cat        |      0.00 %
log_dist2lakes      |      0.00 %
hgyes               |      0.13 %
lat_fam_rescaled    |      0.01 %
lon_fam_rescaled    |      0.43 %
lat_rescaled        |      0.00 %
lon_rescaled        |      0.16 %
Ilat_fam_rescaledE2 |      0.00 %
Ilon_fam_rescaledE2 |      0.27 %
Ilat_rescaledE2     |      0.03 %
INb_color_catE2     |      0.00 %

# Random Effects (Conditional Model)

Parameter       | inside ROPE
-----------------------------
Nb_color_cat    |      0.00 %
INb_color_catE2 |      0.00 %
               Parameter ROPE_low ROPE_high  p_ROPE Effects   Component
7            b_Intercept    -0.01      0.01 2.5e-04   fixed conditional
15               b_UVB_z    -0.01      0.01 0.0e+00   fixed conditional
13        b_Nb_color_cat    -0.01      0.01 0.0e+00  random conditional
14        b_Nb_color_cat    -0.01      0.01 0.0e+00   fixed conditional
10      b_log_dist2lakes    -0.01      0.01 3.3e-04   fixed conditional
1                b_hgyes    -0.01      0.01 1.3e-03   fixed conditional
8     b_lat_fam_rescaled    -0.01      0.01 8.3e-05   fixed conditional
11    b_lon_fam_rescaled    -0.01      0.01 4.1e-03   fixed conditional
9         b_lat_rescaled    -0.01      0.01 0.0e+00   fixed conditional
12        b_lon_rescaled    -0.01      0.01 1.5e-03   fixed conditional
2  b_Ilat_fam_rescaledE2    -0.01      0.01 0.0e+00   fixed conditional
4  b_Ilon_fam_rescaledE2    -0.01      0.01 2.6e-03   fixed conditional
3      b_Ilat_rescaledE2    -0.01      0.01 2.5e-04   fixed conditional
5      b_INb_color_catE2    -0.01      0.01 2.5e-04   fixed conditional
6      b_INb_color_catE2    -0.01      0.01 2.5e-04  random conditional

Multicollinearity

We used the variance inflation factor (VIF) with a threshold of 5, to detect multicollinearity between the covariates in this model.

Table continues below
UVB_z Nb_color_cat log_dist2lakes hg lat_fam_rescaled
13.73 1.285 1.144 1.455 7.675
lon_fam_rescaled lat_rescaled lon_rescaled
5.105 26.9 5.081

UVR and latitude have the highest VIF scores in our dataset, but, according to our mediation analysis, latitude has an effect on Blue only through UVR. We compared (using both Bayes factors and K-fold) the following four models:

  • model_nolat: UVR as only predictor
  • model_nouvb_quadr: latitude2 as only predictor
  • model_nouvb_lin: latitude as only predictor
  • model_nouvb: latitude and latitude2 as predictors
Bayes factor: model_nolat vs model_nouvb: 0.012; very strong evidence for model_nouvb.
Bayes factor: model_nolat vs model_nouvb_lin: 13.909; strong evidence for model_nolat.
Bayes factor: model_nolat vs model_nouvb_quadr: 6.848; moderate evidence for model_nolat.
            elpd_diff se_diff
model_nolat  0.0       0.0   
model_nouvb -2.0       3.7   
                elpd_diff se_diff
model_nolat      0.0       0.0   
model_nouvb_lin -1.2       3.1   
                  elpd_diff se_diff
model_nouvb_quadr  0.0       0.0   
model_nolat       -1.6       2.6   

With Bayes factors: (latitude + latitude2) > UVR > [latitude2 & latitude]

With LOO/WAIC/KFOLD: all models seem comparable (the difference between models is smaller than the SE).

Let’s remove UVR from our best complex model:

Bayes factor: model_nolat vs model_nouvb: 650.848; extreme evidence for best model.

… and let’s remove latitude from our best model (quadratic and linear):

Bayes factor: model_nolat vs model_nouvb: 594687.942; extreme evidence for best model.

Thus, we should keep, UVR and latitude (and latitude2) as predictors of Blue.

This complex model includes many variables. In order to check which variables are the most important, we computed the best model according to LOO, WAIC and KFOLD methods.

Using WAIC, LOO and KFOLD

With this the best model is:

 Family: bernoulli 
  Links: mu = logit 
Formula: Blue ~ Nb_color_cat + dist2lakes + I(Nb_color_cat^2) + (1 | macroarea) + (1 | family_code) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
         total post-warmup samples = 12000

Group-Level Effects: 
~family_code (Number of levels: 32) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.90      1.05     0.21     4.27 1.00     8727     9119

~macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.24      1.27     0.04     4.64 1.00     9285     9887

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         -22.48      6.61   -36.60   -11.04 1.00    11120    11643
Nb_color_cat        5.28      1.68     2.33     8.85 1.00    10784    11768
dist2lakes         -0.01      0.01    -0.03     0.00 1.00    11870    11763
INb_color_catE2    -0.27      0.10    -0.48    -0.09 1.00    10647    11722

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval

# Fixed Effects (Conditional Model)

Parameter       |          95% HDI
----------------------------------
Intercept       | [-35.65, -10.37]
Nb_color_cat    | [  2.20,   8.64]
dist2lakes      | [ -0.03,   0.00]
INb_color_catE2 | [ -0.47,  -0.09]

# Random Effects (Conditional Model)

Parameter       |          95% HDI
----------------------------------
Nb_color_cat    | [  2.20,   8.64]
INb_color_catE2 | [ -0.47,  -0.09]
# Proportion of samples inside the ROPE [-0.01, 0.01]:

# Fixed Effects (Conditional Model)

Parameter       | inside ROPE
-----------------------------
Intercept       |      0.00 %
Nb_color_cat    |      0.00 %
dist2lakes      |     28.88 %
INb_color_catE2 |      0.00 %

# Random Effects (Conditional Model)

Parameter       | inside ROPE
-----------------------------
Nb_color_cat    |      0.00 %
INb_color_catE2 |      0.00 %
          Parameter ROPE_low ROPE_high  p_ROPE Effects   Component
4       b_Intercept    -0.01      0.01 0.00000   fixed conditional
5    b_Nb_color_cat    -0.01      0.01 0.00000  random conditional
6    b_Nb_color_cat    -0.01      0.01 0.00000   fixed conditional
1      b_dist2lakes    -0.01      0.01 0.29525   fixed conditional
2 b_INb_color_catE2    -0.01      0.01 0.00083   fixed conditional
3 b_INb_color_catE2    -0.01      0.01 0.00083  random conditional

Adding covariates except number of color categories

We use the same approach as before, but this time we remove the number of color categories from the set of potential predictors, as its inclusion is debatable.

Using Bayes factors

Whith these, the best fit (according to Bayes factors) is very similar to the best fit obtained before:

 Family: bernoulli 
  Links: mu = logit 
Formula: Blue ~ UVB_z + log_popSize + log_dist2lakes + hg + lat_fam_rescaled + lon_fam_rescaled + lat_rescaled + lon_rescaled + I(lat_fam_rescaled^2) + I(lon_fam_rescaled^2) + I(lat_rescaled^2) + (1 | macroarea) + (1 | family_code) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
         total post-warmup samples = 12000

Group-Level Effects: 
~family_code (Number of levels: 32) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     2.85      1.84     0.34     7.45 1.00     7439     8606

~macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     5.22      3.35     0.70    13.73 1.00     9957     9771

Population-Level Effects: 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              -8.06     18.85   -49.26    26.93 1.00    11143    10471
UVB_z                  -4.16      1.47    -7.35    -1.56 1.00    11761    11574
log_popSize             0.54      0.16     0.28     0.88 1.00    10204    11844
log_dist2lakes         -0.84      0.27    -1.43    -0.35 1.00    11447    11488
hgyes                  -2.48      1.84    -6.24     1.07 1.00    11991    11452
lat_fam_rescaled      -33.88     61.82  -152.83    95.30 1.00    11035    10451
lon_fam_rescaled       -1.19      2.23    -5.70     3.03 1.00    11359    11000
lat_rescaled           24.42     35.03   -44.00    93.91 1.00    11121    11587
lon_rescaled            0.97      1.73    -2.23     4.58 1.00    11369    11683
Ilat_fam_rescaledE2    24.38     42.11   -61.80   105.75 1.00    10991    10648
Ilon_fam_rescaledE2     2.26      3.98    -4.07    11.56 1.00    11095    10778
Ilat_rescaledE2        -6.91     21.73   -49.20    36.39 1.00    11113    11597

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval

Parameter           |           95% HDI
---------------------------------------
Intercept           | [ -46.07,  29.33]
UVB_z               | [  -7.06,  -1.42]
log_popSize         | [   0.26,   0.85]
log_dist2lakes      | [  -1.41,  -0.34]
hgyes               | [  -6.07,   1.21]
lat_fam_rescaled    | [-154.20,  92.51]
lon_fam_rescaled    | [  -5.69,   3.04]
lat_rescaled        | [ -44.00,  93.92]
lon_rescaled        | [  -2.24,   4.56]
Ilat_fam_rescaledE2 | [ -58.99, 107.65]
Ilon_fam_rescaledE2 | [  -4.98,  10.44]
Ilat_rescaledE2     | [ -48.32,  37.22]
# Proportion of samples inside the ROPE [-0.01, 0.01]:

Parameter           | inside ROPE
---------------------------------
Intercept           |      0.05 %
UVB_z               |      0.00 %
log_popSize         |      0.00 %
log_dist2lakes      |      0.00 %
hgyes               |      0.19 %
lat_fam_rescaled    |      0.01 %
lon_fam_rescaled    |      0.38 %
lat_rescaled        |      0.04 %
lon_rescaled        |      0.44 %
Ilat_fam_rescaledE2 |      0.03 %
Ilon_fam_rescaledE2 |      0.21 %
Ilat_rescaledE2     |      0.03 %
               Parameter ROPE_low ROPE_high  p_ROPE Effects   Component
5            b_Intercept    -0.01      0.01 5.0e-04   fixed conditional
12               b_UVB_z    -0.01      0.01 0.0e+00   fixed conditional
9          b_log_popSize    -0.01      0.01 0.0e+00   fixed conditional
8       b_log_dist2lakes    -0.01      0.01 1.7e-04   fixed conditional
1                b_hgyes    -0.01      0.01 1.8e-03   fixed conditional
6     b_lat_fam_rescaled    -0.01      0.01 8.3e-05   fixed conditional
10    b_lon_fam_rescaled    -0.01      0.01 3.6e-03   fixed conditional
7         b_lat_rescaled    -0.01      0.01 4.2e-04   fixed conditional
11        b_lon_rescaled    -0.01      0.01 4.2e-03   fixed conditional
2  b_Ilat_fam_rescaledE2    -0.01      0.01 2.5e-04   fixed conditional
4  b_Ilon_fam_rescaledE2    -0.01      0.01 2.0e-03   fixed conditional
3      b_Ilat_rescaledE2    -0.01      0.01 2.5e-04   fixed conditional

Cheking the importance of UVR:

Bayes factor: bestmodel2 vs bf_bestmodel2_nouvb: 1265.286; extreme evidence for keeping UVR.
Using WAIC, LOO and KFOLD

With this the best model is:

 Family: bernoulli 
  Links: mu = logit 
Formula: Blue ~ UVB_z + log_popSize + log_dist2lakes + (1 | macroarea) + (1 | family_code) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
         total post-warmup samples = 12000

Group-Level Effects: 
~family_code (Number of levels: 32) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.34      0.95     0.07     3.62 1.00     7624    10629

~macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     2.94      2.31     0.18     8.93 1.00     8015     8999

Population-Level Effects: 
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         -2.67      2.08    -6.86     1.46 1.00    10972    10700
UVB_z             -1.73      0.51    -2.88    -0.87 1.00     9838    11138
log_popSize        0.42      0.11     0.23     0.67 1.00    10510    11005
log_dist2lakes    -0.52      0.21    -0.97    -0.14 1.00     9695    11313

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval

Parameter      |        95% HDI
-------------------------------
Intercept      | [-6.68,  1.61]
UVB_z          | [-2.79, -0.82]
log_popSize    | [ 0.22,  0.65]
log_dist2lakes | [-0.95, -0.12]
# Proportion of samples inside the ROPE [-0.01, 0.01]:

Parameter      | inside ROPE
----------------------------
Intercept      |      0.14 %
UVB_z          |      0.00 %
log_popSize    |      0.00 %
log_dist2lakes |      0.00 %
         Parameter ROPE_low ROPE_high p_ROPE Effects   Component
1      b_Intercept    -0.01      0.01 0.0013   fixed conditional
4          b_UVB_z    -0.01      0.01 0.0000   fixed conditional
3    b_log_popSize    -0.01      0.01 0.0000   fixed conditional
2 b_log_dist2lakes    -0.01      0.01 0.0011   fixed conditional

Cheking the importance of UVR:

Bayes factor: bestmodel2_kfold vs bestmodel2_kfold_nouvb: 19744.72; extreme evidence for keeping UVR.
                       elpd_diff se_diff
bestmodel2_kfold        0.0       0.0   
bestmodel2_kfold_nouvb -3.0       6.1   

Plotting the predictors

Number of color categories. As predicted, this has a strong positive effect on Blue; thus, the more color categories in a language, the more likely it is to have a word for blue.

Population Size. This becomes predictive only if we do not consider the number of color categories (in fact, those two variables are highly positively correlated); thus, the larger the population, the more likely to have a word for blue.

***Figure 1.*** _Population size vs having a word for Blue._

Figure 1. Population size vs having a word for Blue.

UVR. As suggested by Lindsey and Brown, UVR incidence is an important predictor for Blue; the higher the UVR incidence, the less likely it is to have a word for blue.

***Figure 52.*** _UVR incidence vs having a word for Blue._

Figure 52. UVR incidence vs having a word for Blue.

***Figure 53.*** _Map of UVR incidence on having a word for Blue._

Figure 53. Map of UVR incidence on having a word for Blue.

Latitude. Latitude is highly correlated with UVR incidence, and its effects on Blue fully mediated by it; the closer to the equator a language is, the less likely it is to have a specific word for blue.

***Figure 2.*** _Latitude on having a word for Blue._

Figure 2. Latitude on having a word for Blue.

Distance to lakes. It is unclear why the distance to lakes is a predictor for the presence of words for Blue, but it very probably is a proxy for other (unmeasured) variables (see decision trees below); the more distant from lakes a language is, the less likely it is to have a word for blue.

***Figure 55.*** _Map of log(dist2lakes) on having a word for Blue._

Figure 55. Map of log(dist2lakes) on having a word for Blue.

***Figure 3.*** _log(dist2lakes) on having a word for Blue._

Figure 3. log(dist2lakes) on having a word for Blue.

Latitude and longitude of putative language family origins. Depending on the model selection method used, these are included or not in the set of relevant predictors.

***Figure 57.*** _Map of latitude of language family origins on having a word for Blue._

Figure 57. Map of latitude of language family origins on having a word for Blue.

***Figure 58.*** _Map of longitude of language family origins on having a word for Blue._

Figure 58. Map of longitude of language family origins on having a word for Blue.

***Figure 4.*** _Location of putative origins of language family on having a word for Blue._

Figure 4. Location of putative origins of language family on having a word for Blue.

HG. The effect of subsistence strategies is one of the weakest predictors, with hunter-gatherers (HG) being less likely to possess a specific term for blue.

***Figure 60.*** _Map of subsistence strategy on having a word for Blue._

Figure 60. Map of subsistence strategy on having a word for Blue.

  Blue:no Blue:yes
HG:no 50 78
HG:yes 10 4

Random forests

We used random forests to estimate of the relative importance of our variables in predicting the existence of words for Blue.

Random Forests are supervised machine learning algorithms which can be used for both classification and regression. It constructs a multitude of decision trees on the training sample and gets the prediction from each of the tree (bagging). Then, it selects the best solution by means of voting. Compared to classic decision tree methods, random forests algorithm structure avoid overfitting on training data.

Relative importance of our variables can be measured with two indexes: gini impurity and accuracy-based method. While Gini impurity is a measurement of the likelihood of incorrect classification of randomly classified new instances, accuracy-based methods use the mean dicrease in classification accuracy after permuting each element over all trees. Because the choice of training/testing dataset matters, we will run the algorithm several times with random allocations of the training and testing sets, and computed the mean variable importance.

However, it should be note that random forests have troubles modelling shared variance due to hierarchical structure, here in particular due to language family and macroarea.

See https://www.displayr.com/how-is-variable-importance-calculated-for-a-random-forest/ for details:

Gini-based importance: When a tree is built, the decision about which variable to split at each node uses a calculation of the Gini impurity. For each variable, the sum of the Gini decrease across every tree of the forest, every time that variable is chosen to split a node. The sum is divided by the number of trees in the forest to give an average. The scale is irrelevant: only the relative values matter. There is a bias towards using numeric variables to split nodes because there are potentially many split points.

Accuracy-based importance: Each tree has its testing sample of data that was not used during construction. This sample is used to calculate importance of a specific variable. First, the prediction accuracy on the testing sample is measured. Then, the values of the variable in the testing sample are randomly shuffled, keeping all other variables the same. Finally, the decrease in prediction accuracy on the shuffled data is measured. The mean decrease in accuracy across all trees is reported. This importance measure is also broken down by outcome class. Intuitively, the random shuffling means that, on average, the shuffled variable has no predictive power. This importance is a measure of by how much removing a variable decreases accuracy, and vice versa — by how much including a variable increases accuracy. Note that if a variable has very little predictive power, shuffling may lead to a slight increase in accuracy due to random noise. This in turn can give rise to small negative importance scores, which can be essentially regarded as equivalent to zero importance.

All predictors

The randomForest algorithm correctly classifies 82.82 % of the populations (recall = 86.51 and precision 84.18).

***Figure 61.*** _Variable importance according to randomForest algorithm using the Gini index._

Figure 61. Variable importance according to randomForest algorithm using the Gini index.

***Figure 62.*** _Variable importance according to randomForest algorithm using the accuracy index._

Figure 62. Variable importance according to randomForest algorithm using the accuracy index.

The cForest algorithm correctly classifies 81.55 % of the populations (recall = 85.61 and precision 82.99).

***Figure 63.*** _Variable importance according to cforest algorithm using the accuracy index._

Figure 63. Variable importance according to cforest algorithm using the accuracy index.

Here, we find again the number of color categories as the most important predictor, followed by UVR incidence, daltonism, population size, latitude and distance to lakes. Thus, we found relatively similar results to those form the regression analysis, but also differences: daltonism was not predictive in the regression models, population size only when excluding the number color categories, and the latitude and longitude of language family origins do not appear to be important variables here.

Excluding the number of color categories

The randomForest algorithm correctly classifies 76.04 % of the populations (recall = 77.68 and precision 80.21).

***Figure 64.*** _Variable importance according to randomForest algorithm using the Gini index._

Figure 64. Variable importance according to randomForest algorithm using the Gini index.

***Figure 65.*** _Variable importance according to randomForest algorithm using the accuracy index._

Figure 65. Variable importance according to randomForest algorithm using the accuracy index.

The cForest algorithm correctly classifies 79.89 % of the populations (recall = 82.78 and precision 82.46).

***Figure 66.*** _Variable importance according to cForest algorithm using the accuracy index._

Figure 66. Variable importance according to cForest algorithm using the accuracy index.

When the number of color categories is excluded, we find that population size and UVR and daltonism are the most important predictors, followed by latitude and distance to lakes.

Conclusions

While the results from random forests and regression are congruent, especially concerning the importance of UVR and number of color categories/population size, we also found some differences:

  • daltonism and climate_PC1 are variables that have an impact on Blue in the random forest models, but not in the regression models,
  • latitude and longitude of language family origins have an impact on Blue in regression models, but not in the random forest models.

Decision trees

We used simple decision trees as implemented by ctree (Hothorn, Hornik, & Zeileis, 2006).

***Figure 67.*** _Simple decision tree._

Figure 67. Simple decision tree.

To understand what this decision tree does, we plotted which populations have less/more than 0.726 for z-scored UVR, and less/more than 4.7 for log_dist2lakes.

***Figure 68.*** _UVR (z-scored) as split by the decision tree._

Figure 68. UVR (z-scored) as split by the decision tree.

***Figure 69.*** _Log of distance to lakes as split by the decision tree._

Figure 69. Log of distance to lakes as split by the decision tree.

Thus, it seems that the distance to lakes is relevant only when there are exactly 6 color categories, in which case populations living more than about 25 km (= 24.7) away from the closest lake do not have a specific word for Blue, whereas population living closer have a 75% chance to possess such a word. While we could not find a satisfing explanation for this “lake effect”, we suggest some hypotheses (among many others) below:

  • populations living close to lakes see more often the blue color of the lake’s waters. However, oceans and seas are also blue (and there is no “sea/ocean effect” in our data), and not all lakes necessarily have clean, blue water
  • distance to lakes is in fact a proxy for continents: indeed, the map of distance to lakes suggests that population in parts of South America, equatorial Africa and Papunesia tends to be far away from lakes. However, we did control macroarea as a random effect
  • distance to lakes is a proxy for economic development: there are suggestions that closeness to large bodies of water may favour historical economic development (but not only lakes), but this hypothesis cannot be tested with the currently available data
  • distance to lakes as a proxy for latitude, but splitting latitude into bins did not support this idea
  • of course, it is entirely possible that the relationship between distance to lake and having a word for Blue is entirely accidental in out dataset.

Other machine learning techniques

To find out if there are some others relations between the variables in our dataset and Blue that could have been missed with by regression, random forest and decision tree models, we compared their performance with that of Support Vector Machines (SVM; as implemented by the e1071 package).

SVM is a supervised machine learning technique, which can be used for both classification and regression. It uses the kernel trick to find a hyperplane in an N-dimensional space (N — the number of features) that distinctly classifies the data points.

We randomly splitted the dataset into a training and testing set, and we computed the precision, recall and classification rate from the model’s confusion matrix. The training set contains 80% of the dataset, and we performed a stratified sampling on the variable Blue. Because our dataset is quite small, the choice of the training and testing set impacted the results. To avoid this problem, we ran those algorithm a 30 times to compute a mean classification rate, recall and precision.

Model with all covariates:

***Figure 70.*** _Mean performance of classification with all covariates for brms, randomForest and SVM algorithms._

Figure 70. Mean performance of classification with all covariates for brms, randomForest and SVM algorithms.

Model with only Number of color categories (linear and quadratic) and distance to lakes as covariates:

***Figure 71.*** _Mean performance of classification for brms, randomForest and SVM algorithms for a model including only number of color categories (linear and quadratic) and distance to lakes._

Figure 71. Mean performance of classification for brms, randomForest and SVM algorithms for a model including only number of color categories (linear and quadratic) and distance to lakes.

The performance rate is very similar for all algorithms. The classification accuracy on the testing set is high for bayesian mixed-effect model, random forest and SVM. Thus, a simple mixed-effect bayesian model is enough to catch the explanatory power of our variables for predicting the presence of a specific term for blue.


Conclusions

See Plotting the predictors for data visualization, but the main conclusions are:

  • as expected, given Berlin & Kay (1991)’s work, the number of color categories is the most relevant variable for predicting the presence or absence of a specific word for blue in a population, it effect being positive;
  • surprisingly, distance to lakes seems to play an important (negative) role, even though the reasons remains unclear;
  • UVR incidence has an important negative effect on the presence of a word for blue; in fact, its effect remains significant even when after controlling for cultural and ecological variables. However, depending on the specific technique and model comparison methods used, the strength of this effect varies somewhat;
  • latitude is highly multicollinear with UVR incidence, and its (positive) impact on the presence of a word for blue is fully mediated by UVR incidence;
  • finally, depending on the methods and the variables added as covariates, population size, daltonism, and the geographic coordinates of putative language family origins may also be relevant predictors for the presence of a word for blue.


Hypothesis 2: does UVR incidence influence the population frequency of abnormal color perception?


Mediation analysis

Then, we investigated the relation between UVR and daltonism. As a preliminary analysis, we conducted a mediation analysis (using the mediation package in R; Tingley, Yamamoto, Hirose, Keele, & Imai, 2014) to test whether UVR is a mediator between latitude and daltonism.

Second, we carried out a mediation analysis (using the brm function from brms package in R: Bürkner, 2018a and the mediation function from spatstat; Baddeley et al., 2015) to test whether UVR is a mediator between latitude and daltonism, and between latitude of language family and daltonism.

latitude –> UVR –> daltonism ?

Bayes model result and diagnostic
 Family: MV(gaussian, beta) 
  Links: mu = identity; sigma = identity
         mu = logit; phi = identity 
Formula: UVB_z ~ lat_rescaled + (1 | macroarea) + (1 | family_code) 
         daltonism_beta ~ UVB_z + lat_rescaled + (1 | macroarea) + (1 | family_code) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
         total post-warmup samples = 12000

Group-Level Effects: 
~family_code (Number of levels: 32) 
                            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(UVBz_Intercept)              0.13      0.07     0.01     0.27 1.00     8957
sd(daltonismbeta_Intercept)     0.30      0.20     0.02     0.76 1.00     7143
                            Tail_ESS
sd(UVBz_Intercept)              9499
sd(daltonismbeta_Intercept)     8804

~macroarea (Number of levels: 6) 
                            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(UVBz_Intercept)              0.14      0.13     0.01     0.45 1.00     9123
sd(daltonismbeta_Intercept)     0.64      0.40     0.15     1.65 1.00     9540
                            Tail_ESS
sd(UVBz_Intercept)             10285
sd(daltonismbeta_Intercept)     9637

Population-Level Effects: 
                           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
UVBz_Intercept                -4.69      0.23    -5.17    -4.25 1.00    10662
daltonismbeta_Intercept       -3.73      1.04    -5.75    -1.68 1.00    11156
UVBz_lat_rescaled              5.60      0.24     5.14     6.09 1.00    10858
daltonismbeta_UVB_z           -0.30      0.18    -0.64     0.07 1.00    11555
daltonismbeta_lat_rescaled    -0.01      1.15    -2.29     2.19 1.00    11170
                           Tail_ESS
UVBz_Intercept                11506
daltonismbeta_Intercept       11765
UVBz_lat_rescaled             11448
daltonismbeta_UVB_z           11648
daltonismbeta_lat_rescaled    11845

Family Specific Parameters: 
                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_UVBz            0.35      0.02     0.31     0.39 1.00    11376    11572
phi_daltonismbeta    52.21      7.23    39.19    67.41 1.00    10945    11788

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Mediation result

With prob = 0.90:


# Causal Mediation Analysis for Stan Model

  Treatment: lat_rescaled
   Mediator: UVB_z
   Response: daltonismbeta

                 Estimate     HDI (90%)
  Direct effect:     0.00 [-1.99  1.78]
Indirect effect:    -1.66 [-3.34  0.00]
   Total effect:    -1.65 [-2.47 -0.90]

Proportion mediated: 100.41% [-26.19% 227.02%]

With prob = 0.95:


# Causal Mediation Analysis for Stan Model

  Treatment: lat_fam_rescaled
   Mediator: UVB_z
   Response: daltonismbeta

                 Estimate     HDI (95%)
  Direct effect:     0.92 [ 0.04  1.80]
Indirect effect:    -1.24 [-1.88 -0.69]
   Total effect:    -0.34 [-1.10  0.44]

Proportion mediated: 366.93% [-2782.92% 3516.78%]
Comparaison of predicted and observed values for the mediator and outcome

The faint lines (yrep) are posterior predictive draws, and y is the observed distribution.

Putative latitude of language family –> UVR –> daltonism ?

Bayes model result and diagnostic
 Family: MV(gaussian, beta) 
  Links: mu = identity; sigma = identity
         mu = logit; phi = identity 
Formula: UVB_z ~ lat_fam_rescaled + (1 | macroarea) 
         daltonism_beta ~ UVB_z + lat_fam_rescaled + (1 | macroarea) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
         total post-warmup samples = 12000

Group-Level Effects: 
~macroarea (Number of levels: 6) 
                            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(UVBz_Intercept)              0.35      0.22     0.10     0.88 1.00    10623
sd(daltonismbeta_Intercept)     0.75      0.42     0.32     1.81 1.00    10502
                            Tail_ESS
sd(UVBz_Intercept)             11096
sd(daltonismbeta_Intercept)    11353

Population-Level Effects: 
                               Estimate Est.Error l-95% CI u-95% CI Rhat
UVBz_Intercept                    -2.78      0.42    -3.58    -1.95 1.00
daltonismbeta_Intercept           -4.56      0.54    -5.64    -3.54 1.00
UVBz_lat_fam_rescaled              3.42      0.41     2.61     4.23 1.00
daltonismbeta_UVB_z               -0.37      0.08    -0.52    -0.21 1.00
daltonismbeta_lat_fam_rescaled     0.93      0.45     0.06     1.81 1.00
                               Bulk_ESS Tail_ESS
UVBz_Intercept                    11344    11220
daltonismbeta_Intercept           10700    11556
UVBz_lat_fam_rescaled             11362    11198
daltonismbeta_UVB_z               12049    11623
daltonismbeta_lat_fam_rescaled    11422    11973

Family Specific Parameters: 
                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_UVBz            0.71      0.04     0.63     0.80 1.00    11300    11976
phi_daltonismbeta    50.48      6.56    38.24    63.92 1.00    11904    11806

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Mediation result

With prob = 0.9:


# Causal Mediation Analysis for Stan Model

  Treatment: lat_fam_rescaled
   Mediator: UVB_z
   Response: daltonismbeta

                 Estimate     HDI (90%)
  Direct effect:     0.92 [ 0.20  1.68]
Indirect effect:    -1.24 [-1.75 -0.76]
   Total effect:    -0.34 [-0.97  0.32]

Proportion mediated: 366.93% [-1249.26% 1983.11%]

With prob = 0.95:


# Causal Mediation Analysis for Stan Model

  Treatment: lat_fam_rescaled
   Mediator: UVB_z
   Response: daltonismbeta

                 Estimate     HDI (95%)
  Direct effect:     0.92 [ 0.04  1.80]
Indirect effect:    -1.24 [-1.88 -0.69]
   Total effect:    -0.34 [-1.10  0.44]

Proportion mediated: 366.93% [-2782.92% 3516.78%]
Comparaison of predicted and observed values for the mediator and outcome

The faint lines (yrep) are posterior predictive draws, and y is the observed distribution.

Conclusions

The mediation analysis indicates that the effect of latitude (and latitude2) on the population frequency of abnormal color percption is fully mediated by UVR incidence (when macroarea and family are random effects).

Regression models

UVR incidence only

We investigate here the effects of UVR incidence on daltonism.

We use the same methods as used the previous part, namely Bayesian mixed-effects models (implemented in package brms; Bürkner (2018b)) with language family and macroarea as random effects. As daltonism is a percentage, we used Beta regression.

 Family: beta 
  Links: mu = logit; phi = identity 
Formula: daltonism_beta ~ UVB_z + (1 | macroarea) + (1 | family_code) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 3000; warmup = 500; thin = 1;
         total post-warmup samples = 10000

Group-Level Effects: 
~family_code (Number of levels: 32) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.29      0.19     0.02     0.75 1.00     1336     2436

~macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.63      0.36     0.15     1.57 1.00     2337     1977

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -3.74      0.32    -4.43    -3.11 1.00     3591     3281
UVB_z        -0.29      0.07    -0.45    -0.15 1.00     5088     6282

Family Specific Parameters: 
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi    52.41      7.13    39.40    67.53 1.00     5846     6664

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval

Parameter |        95% HDI
--------------------------
Intercept | [-4.38, -3.07]
UVB_z     | [-0.44, -0.15]
# Proportion of samples inside the ROPE [-0.01, 0.01]:

Parameter | inside ROPE
-----------------------
Intercept |      0.00 %
UVB_z     |      0.00 %
    Parameter ROPE_low ROPE_high p_ROPE Effects   Component
1 b_Intercept    -0.01      0.01      0   fixed conditional
2     b_UVB_z    -0.01      0.01      0   fixed conditional

As before, including random slopes did not make any significant contribution (Bayes factor 0.02, very strong evidence for random intercepts only). Consequently, we retained the random intercepts-only model.

The effect of UVR on daltonism is strong and negative:

  • Bayes factor 657.43, extreme evidence for model with UVB incidence; 95% HDI for βUVR = [-0.44, -0.15]
  • K-fold model comparaison: evidence for model_uvb: expected log predictive density (ELPD) = -5.19 , SE = 4.57

The model with UVR explains R2 = 49.06% of the variance. The Root Mean Square Error of the prediction (RMSE) is 2.18.

Adding covariates

We added covariates and selected the best model with Bayes factors on one hand, and with WAIC, LOO and KFOLD on the other hand.

Using Bayes factors

The best model is:

 Family: beta 
  Links: mu = logit; phi = identity 
Formula: daltonism_beta ~ Blue + lat_rescaled + lat_fam_rescaled + I(lat_rescaled^2) + I(lat_fam_rescaled^2) + (1 | macroarea) + (1 | family_code) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
         total post-warmup samples = 12000

Group-Level Effects: 
~family_code (Number of levels: 32) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.20      0.17     0.01     0.61 1.00     8208     9223

~macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.55      0.35     0.13     1.44 1.00    10524     9872

Population-Level Effects: 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              -3.98      1.59    -7.24    -0.98 1.00    11717    10727
Blueyes                 0.47      0.15     0.18     0.77 1.00    12229    11765
lat_rescaled            0.13      3.94    -7.54     8.01 1.00    11822    11687
lat_fam_rescaled        1.07      4.73    -7.78    11.20 1.00    11000    10269
Ilat_rescaledE2        -1.30      2.68    -6.61     3.92 1.00    11862    11687
Ilat_fam_rescaledE2    -0.08      3.37    -7.22     6.25 1.00    11027    10394

Family Specific Parameters: 
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi    54.74      7.37    41.59    70.11 1.00    11670    11529

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval

Parameter           |        95% HDI
------------------------------------
Intercept           | [-7.28, -1.03]
Blueyes             | [ 0.19,  0.77]
lat_rescaled        | [-7.43,  8.04]
lat_fam_rescaled    | [-8.22, 10.60]
Ilat_rescaledE2     | [-6.62,  3.89]
Ilat_fam_rescaledE2 | [-6.85,  6.57]
# Proportion of samples inside the ROPE [-0.01, 0.01]:

Parameter           | inside ROPE
---------------------------------
Intercept           |      0.00 %
Blueyes             |      0.00 %
lat_rescaled        |      0.18 %
lat_fam_rescaled    |      0.19 %
Ilat_rescaledE2     |      0.34 %
Ilat_fam_rescaledE2 |      0.24 %
              Parameter ROPE_low ROPE_high  p_ROPE Effects   Component
4           b_Intercept    -0.01      0.01 0.00025   fixed conditional
1             b_Blueyes    -0.01      0.01 0.00050   fixed conditional
6        b_lat_rescaled    -0.01      0.01 0.00175   fixed conditional
5    b_lat_fam_rescaled    -0.01      0.01 0.00183   fixed conditional
3     b_Ilat_rescaledE2    -0.01      0.01 0.00325   fixed conditional
2 b_Ilat_fam_rescaledE2    -0.01      0.01 0.00225   fixed conditional

The best model using Bayes factor explains R2 = 51.3% of the variance. The Root Mean Square Error of the prediction (RMSE) is 2.15.

Is HG predictive?

Bayes factor: best model with HG vs best model without HG: 2.165; anecdotal evidence for best light model with HG.
              elpd_diff se_diff
d_bestmodel_1  0.0       0.0   
d_bestmodel   -0.7       2.8   

There is an anecdotal evidence for the model with hg. The effect is not significant and negative: hunter-gatherers are more likely not to have a word for blue.

According to our mediation analysis, latitude (and latitude2) have an effect on daltonism only through the mediator UVR.

This complex model includes many variables. In order to check which variables are the most important, we computed the best model according to LOO, WAIC and KFOLD methods.

Using WAIC, LOO and KFOLD

With this the best model is:

 Family: beta 
  Links: mu = logit; phi = identity 
Formula: daltonism_beta ~ Blue + UVB_z + I(lat_fam_rescaled^2) + (1 | macroarea) + (1 | family_code) 
   Data: bNA (Number of observations: 142) 
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
         total post-warmup samples = 12000

Group-Level Effects: 
~family_code (Number of levels: 32) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.15      0.13     0.01     0.51 1.00    10114     9435

~macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.61      0.36     0.20     1.51 1.00    10548    10692

Population-Level Effects: 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              -4.40      0.44    -5.28    -3.53 1.00    11607    11225
Blueyes                 0.47      0.15     0.18     0.75 1.00    11636    11009
UVB_z                  -0.27      0.08    -0.43    -0.11 1.00    12113    11646
Ilat_fam_rescaledE2     0.52      0.38    -0.24     1.27 1.00    12210    11608

Family Specific Parameters: 
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi    55.10      7.27    41.77    70.04 1.00    11617    10582

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval

Parameter           |        95% HDI
------------------------------------
Intercept           | [-5.25, -3.50]
Blueyes             | [ 0.19,  0.76]
UVB_z               | [-0.43, -0.11]
Ilat_fam_rescaledE2 | [-0.22,  1.29]
# Proportion of samples inside the ROPE [-0.01, 0.01]:

Parameter           | inside ROPE
---------------------------------
Intercept           |      0.00 %
Blueyes             |      0.00 %
UVB_z               |      0.00 %
Ilat_fam_rescaledE2 |      0.91 %
              Parameter ROPE_low ROPE_high  p_ROPE Effects   Component
3           b_Intercept    -0.01      0.01 0.00000   fixed conditional
1             b_Blueyes    -0.01      0.01 0.00025   fixed conditional
4               b_UVB_z    -0.01      0.01 0.00058   fixed conditional
2 b_Ilat_fam_rescaledE2    -0.01      0.01 0.00867   fixed conditional

The best model using K-fold explains R2 = 51.67% of the variance. The Root Mean Square Error of the prediction (RMSE) is 2.09.

Adding HG:

Bayes factor: best model with HG vs best model without HG: 1.686; anecdotal evidence for best light model with HG.
                   elpd_diff se_diff
d_bestlightmodel    0.0       0.0   
d_bestlightmodel_1 -3.7       3.5   

Subsistence mode should not be added to this model.

UVR must be kept:

Bayes factor: best light model vs best light model without UVB: 36.668; very strong evidence for best light model.
                       elpd_diff se_diff
d_bestlightmodel        0.0       0.0   
d_bestlightmodel_nouvb -3.5       3.2   

Plotting the predictors

UVR. UVR has a negative significant impact on Daltonism; thus, the higher the UVR incidence on a population, the less green-red color blind males there will be in this population.

***Figure 72.*** _UVR incidence vs daltonism rate._

Figure 72. UVR incidence vs daltonism rate.

***Figure 73.*** _Map of UVR incidence on daltonism rate. A low daltonism rate is a rate below the median of all daltonism rates (3.3), whereas a high daltonism rate codes for rates above 3.3._

Figure 73. Map of UVR incidence on daltonism rate. A low daltonism rate is a rate below the median of all daltonism rates (3.3), whereas a high daltonism rate codes for rates above 3.3.

Latitude. Latitude is highly correlated with UVR incidence, and its effects on Blue fully mediated by it; The closer to the equator a population is, the less likely it is to include red-green color-blind males.

***Figure 74.*** _Latitude on daltonism rate._

Figure 74. Latitude on daltonism rate.

HG. With Bayes factor, HG slightly improve the fit.

However, one should note that only 14 hunter-gatherers groups are present in our dataset.

Hunter-gatherer males are less color-blind, even in places where the UVR incidence is low. We may suggest that selective pressures for color perception is higher for hunter-gatherers than for agriculture-based communities.

***Figure 75.*** _Subsistence strategy on daltonism rate._

Figure 75. Subsistence strategy on daltonism rate.

Blue. Population having a high red-green color blindness rate tends to have a specific term for blue in their vocabulary.

According to our hypothesis, Blue would be a proxy for green-blue perception, but there would be no causal relationship between Blue and daltonism.

***Figure 76.*** _Having a word for blue on daltonism rate._

Figure 76. Having a word for blue on daltonism rate.

Latitude of language family.

It is significant included in the set of relevant predictors.

***Figure 77.*** _Map of latitude of language family origins on daltonism rate._

Figure 77. Map of latitude of language family origins on daltonism rate.

***Figure 78.*** _Location of putative origins of language family on daltonism rate._

Figure 78. Location of putative origins of language family on daltonism rate.

Variable importance with Random Forests

We used random forests to estimate of the relative importance of our variables in predicting the daltonism rate. Variable importance was computed with gini-based and permutation-based indexes through the same process as before.

For more information, please see Random forests.

The Root Mean Squared Error (RMSE) for randomForest function is 207.18 and the R-squared explained variance is 32.58.

***Figure 79.*** _Variable importance according to randomForest algorithm using the Gini index._

Figure 79. Variable importance according to randomForest algorithm using the Gini index.

***Figure 80.*** _Variable importance according to randomForest algorithm using the accuracy index._

Figure 80. Variable importance according to randomForest algorithm using the accuracy index.

The Root Mean Squared Error (RMSE) for cForest function is 205.73 and the R-squared explained variance is 33.51.

***Figure 81.*** _Variable importance according to cforest algorithm using the accuracy index._

Figure 81. Variable importance according to cforest algorithm using the accuracy index.

Thus, we find that UVR incidence is the most important predictor, but also that latitude and the presence of a blue word matter. Depending on the method used, climate PC1, climate PC2, longitude and subsistence strategy are also included in the set of relevant predictors.

These results are consistent with the results from the regression analysis, except for the inclusion here of climate variables; moreover, the latitude and longitude of language family origins do not appear to be important variables here.

Decision tree

We used simple decision trees as implemented by ctree (Hothorn et al., 2006).

***Figure 82.*** _Simple decision tree._

Figure 82. Simple decision tree.

To understand what this decision tree does, we plotted which populations have less/more than -0.337 and -1.315 for z-scored UVR, less/more than 8.275 for log_dist2oceans, and less/more than 0.793 for latitude of language family origin.

***Figure 83.*** _UVR (z-scored) as split by the decision tree._

Figure 83. UVR (z-scored) as split by the decision tree.

***Figure 83.*** _UVR (z-scored) as split by the decision tree._

Figure 83. UVR (z-scored) as split by the decision tree.

***Figure 84.*** _Log of distance to oceans as split by the decision tree._

Figure 84. Log of distance to oceans as split by the decision tree.

***Figure 85.*** _Latitude of language family origins as split by the decision tree._

Figure 85. Latitude of language family origins as split by the decision tree.

Here, UVR is the main and first node in the tree; as suggested by the regression analysis, Blue and subsistence mode play important roles:

  • when UVR incidence is high, populations having a word for Blue tend to have higher frequencies of abnormal color perception than populations lacking a word for blue,
  • when UVR incidence is low and the population’s subsistence strategy is based on hunting, gathering and foraging, the frequency of abnormal color perception is likely to be low.

Other machine learning techniques

Here, we compare the performance of Bayesian mixed-effects models, RandomForest algorithm and Support Vector Machine (SVM). For more information about:

Daltonism is a continuous variable, so we will use 4 measures of regression model accuracy:

  • MAE (Mean absolute error) represents the difference between the original and predicted values extracted by averaged the absolute difference over the data set.
  • MSE (Mean Squared Error) represents the difference between the original and predicted values extracted by squared the average difference over the data set.
  • RMSE (Root Mean Squared Error) is the error rate by the square root of MSE.
  • R-squared (Coefficient of determination) represents the coefficient of how well the values fit compared to the original values. The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is.

Formulas

Model with all covariates:

MAE and RMSE:

***Figure 86.*** _Measures of model regression accuracy (MAE and RMSE), with the model including all covariates._

Figure 86. Measures of model regression accuracy (MAE and RMSE), with the model including all covariates.

The unit in this graph is linked to daltonism unit: as a reminder, daltonism varies between 0 and 10.68 (percentage of color-blind males in the population). The values for MSE, RMSE and MAE in a good model tend to be close to 0.

However, the unit for Rsquared is a percentage varying between 0 and 100, where a good model tends to be close to 100:

Rsquared:

***Figure 87.*** _Measures of model regression accuracy (R-squared), with the model including all covariates._

Figure 87. Measures of model regression accuracy (R-squared), with the model including all covariates.

Model with only UVR, presence of a blue word, quadratic effect of latitude of putative language family origin:

MAE and RMSE:

***Figure 88.*** _Measures of model regression accuracy (MAE and RMSE), with the model including only UVR, Blue, latitude² of language family origins._

Figure 88. Measures of model regression accuracy (MAE and RMSE), with the model including only UVR, Blue, latitude² of language family origins.

***Figure 89.*** _Measures of model regression accuracy (R-squared),with the model including only UVR, Blue, latitude² of language family origins._

Figure 89. Measures of model regression accuracy (R-squared),with the model including only UVR, Blue, latitude² of language family origins.

The performance rate is very similar for all algorithms, and the explained variance is quite low for bayesian mixed-effect model, random forest and SVM. Thus, a simple mixed-effect bayesian model is enough to catch the explanatory power of our variables for predicting daltonism.

However, our variables do not explain a full account of the variance of daltonism rate, and the error rate is still quite high.

Note: In previous parts, when we performed bayesian mixed-effect model and random forest algorithm, we computed the R-squared and we got estimates varying between 30 and 45%. However, these estimation were made on the full dataset, and not on only testing set.

Conclusions

Thus:

  • UVR incidence is strongly negatively related to the population frequency of red-green abnormal color perception, even after controlling for various ecological and cultural confounds;
  • subsistence strategy is also significantly associated, with hunter-gatherers being less likely to have red-green abnormal color perception;
  • having a dedicated word for blue is also positively related, and we suggest that this is in fact a proxy for high levels of abnormal red-green color perception (as per the 1st hypothesis);
  • the influence of latitude is fully mediated by UVR incidence;
  • depending on the particular method used, climate and the putative geographic origin of language family may be relevant predictors.


Discussion and conclusions

In this study, I investigated if differences in eye physiology across populations may help explain differences in the colour lexicon and on the frequency of colour blindness. By performing Point Pattern Analysis, mediation analysis, mixed-effects Bayesian regressions, decision trees/random forests and Support Vector Machines on a set of 142 populations, I explored the relationship between UV incidence and the presence of a specific term for blue on the one hand, and between UV incidence and the frequency of red-green abnormal colour perception in males, on the other hand.

1. Point Pattern Analysis

Point pattern analysis operates in a specific conceptual framework where points are events with an expected probability of occurrence at any location. However, the location of populations is constrained by many factors. Analysis of spatial randomness and dependence added to the covariate analysis highlighted the obvious repartition of languages on earth: humans inhospitable places and choose to live on continents (as opposed to water). Our dataset sample is quite representative of the world populations, although our sample seems very rich in Europe and India, whereas other regions like Australia and Papua New Guinea are not as well-covered.

We also found that our points attributes (i. e. Blue and daltonism) are not dispersed randomly across populations. Indeed, both spatial autocorrelation and neighbouring coefficient analysis indicate that the same attributes tend to be located in the same areas.

Furthermore, genetic distance (computed with ultrametric data imputation) results in the highest autocorrelation for colour blindness and the presence of a specific blue term. As inherited red-green colour blindness is due to variation at the level of the DNA sequence (Stelzer et al. (2016)), it is tempting to suggest that genetic distance is consistent with the physiological explanation of colour blindness. Indeed, we assume that overall genetic distances reflect population demographic history, which contains the history of the color-blindness-related DNA sequences.

Yet, the association between genetic distance and the distribution of a word for blue is not straightforward to interpret. As (Komarova, Jameson, & Narens (n.d.)) suggest in their evolutionary models, colour blindness potentially affects the colour lexicon. (Lindsey & Brown (2002)) also claimed that abnormal colour perception in the green-blue range would change colour vocabulary, and especially the presence of a specific term for blue. One could also say that populations that have a short genetic distance between each other would share a common history and culture which explain similarities in colour lexicons.

2. Hypothesis 1

Introduction:

There is a lot of evidence supporting the universalist perspective (Heider & Olivier (1972); Berlin & Kay (1991); Regier & Kay (2009)), which proposes that the development of the colour lexicon is very similar across all languages and human groups, due to shared commonalities in the eye physiology and cognition. In this thesis, I do not directly challenge this finding, but try to understand how patterns of variation between languages in the presence of a specific term for “blue” could emerge. Contrary to previous universalist work, which assume that the physiological basis of vision in the eye is the same for all humans, I do not bring the eye physiology as a factor driving similarities for colour vocabularies in the world. Instead, I assume that differences in the eye physiology could be responsible for changes in colour perception; in doing so, I bring the eye physiology as a factor driving variation across languages.

Our result’s summary:

As suggested by (Lindsey & Brown (2002)), I found that UV incidence is a predictor for Blue. However, other variables in our dataset are also predictive for the presence of a specific term for blue, such as the number of colour categories which has a strong positive effect. Other cultural (population size and subsistence mode), environmental (distance to lakes) and geographical variables (latitude and longitude of putative language family origins) are also included in the set of relevant predictors. Finally, the mediation analysis showed that the effect of latitude on Blue is fully mediated by UV incidence.

Explanation of our results according to the relativist perspective, and relative contribution of different factor to the construction of colour categories:

The relativist perspective, as presented by the work of Davidoff, Davies, & Roberson (1999) and Roberson, Davidoff, Davies, & Shapiro (2004), has highlighted the idea that different environments and cultures may be the source of colour lexicons differences. In this thesis, the addition of other (1) cultural, (2) environmental and (3) genetic variables in the model put forward their relative contribution to understanding the presence of a specific word for blue.

  1. Cultural. Berlin & Kay (1991) proposed that the development of dyeing techniques would lead societies to have a greater need for additional colour terms. Also, several studies suggest a significant association between population size and cultural complexity (Shennan (2001); Henrich (2004); Powell, Shennan, & Thomas (2009); Premo & Kuhn (2010); Derex, Beugin, Godelle, & Raymond (2013)). Following those two assumptions, both our variables number of colour categories and population size would be indicators for cultural complexity in our models. Consequently, the important role of the number of colour categories and the population size in our results, brings forward cultural complexity as one of the main contributors to differences between the colour lexicons in what concerns the presence of a specific word for blue, consistent with Ember (1978)’s findings. Gibson et al. (2017) also emphasized the role of culture in the evolution of the colour lexicon. They suggest that « salient » objects – to which humans have a need to refer to - are usually characterized by their warm colours (red, yellow…), whereas « uninformative » backgrounds are usually cool-coloured (for example with blue, green, and brown colours), leading to the under-representation of cool colours in color lexicons. Because industrialization generate objects distinguishable solely based on colour, they postulate industrialized cultures would have a greater need for additional colour categories, resulting in the appearance of extra categories for « blue », « brown » and « grey » colours.

  2. Environmental. Likewise, the distance to lakes would contribute to the presence of a specific term for blue; however, we do not have any explanation about the reasons underlying this finding. According to Goldstone (1998), what is available in the perceptual input affects the development of a particular perceptual skill: thus, it is likely that our physical surroundings could play a role in colour lexicons formation. Though, the lack of predictive strength for the climate variables (which determined if the populations live in a tropical, cool, desertic, or oceanic climates) and other distance to water variables suggest that the effect of our physical surrounding on colour lexicons development is low. It is also possible that we did not include the right measures, but as our variables include a lot of information, it is not very likely to be the case.

  3. Genetic. The genetic distance variables included in the model, which give a partial account of the history of people migrations, did not contribute to the presence of a specific word for blue. However, the multi-dimensional scaling applied to the genetic distance matrices did not result in a high goodness-of-fit, and we believe the variables used were not fully reflecting genetic distances. Also, we know that the history of migrations and contact contributed significantly to colour vocabulary. For example, it is believed that the Japanese started to separate blue and green colours when educational materials were imported from the Allied after WW2, and after coloured pencils have been imported to Japan (Bhatia (2018)).

Those findings are fully compatible with our first hypothesis, namely that UV incidence drives changes in the colour lexicon. Indeed, we assume that the observed variation in the colour lexicon of the world’s languages does not depend on a single underlying factor. In this thesis, I find UV incidence as a significant predictor for the presence of a specific blue term, even after controlling for various ecological and cultural confounds. Thus, even though UV incidence is not the main factor impacting the presence or absence of a specific term, this analysis tends to support the idea that UV incidence has its own contribution to the structure of the colour vocabulary, of course, among other predictors.

Causal pattern:

The pattern found suggests that there may be a causal relationship going from UV incidence to Blue, the inverse causal relationship being very unlikely. The chain of causal events in Illustration 6 proposed by @Lindsey & Brown (2002) is a possible explanation for this relationship, even if I do not exclude the possibility that UV incidence could affect the presence of a specific blue term based on other causal assumptions. As an example, one can postulate that UV incidence could partially impact food production system, thus affecting economic development; however, I believe that our climate variables would reflect more this sort of causal assumptions. Also, UV incidence and Blue could be proxies for other variables, but this is probably less likely as we controlled for several cultural and environmental variables. Thus, the causal assumptions linking UV incidence to macular pigment density, and macular pigment density to abnormal colour perception in the green-blue range, remains, in our view, the most probable explanation for the potential effect of UV incidence on colour naming.

Role of physiology:

Therefore, some physical variables – such as UV incidence – could have an indirect role on semantic categories through their effects on physiology. Geographical location, which partly determines the amount of UV incidence reaching the eye, would also indirectly contribute to colour naming through its mediation relationship with UV incidence and, consequently, physiology.

3. Hypothesis 2

Result’s summary:

Using the same methods, we found that UV incidence is strongly negatively related to the population frequency of red-green abnormal colour perception, even after controlling for various ecological and cultural confounds. UV incidence is the strongest predictor according to both random forests and Bayesian models (if latitude is excluded). Indeed, the effect of latitude on daltonism is fully mediated by UV incidence.

Causal pattern:

The strength of UV incidence as a predictive factor for daltonism may suggest, at face value, that UV incidence plays a role in determining male red-green colour blindness rate. If UV incidence and daltonism are not proxies for other unmeasured variables, then a possible causal relationship may go from UV incidence to red-green colour blindness: UV radiations can cause DNA mutations (Pfeifer, You, & Besaratinia (2005)), with skin cancer being an example, but the most common colour blindness conditions are inherited and there is no evidence that UV incidence could cause DNA mutations in gametes. Moreover, the red-green colour blindness mutation is present only in places where the UV incidence rate is low. Therefore, UV incidence cannot be a direct explanation for mutations in the opsin genes. The chain of causality proposed by D. T. Brown & Lindsey (2004) and summarized in Illustration 6 becomes then a much more probable explanation describing the strong impact of UV incidence on daltonism. However, I do not set aside the possibility that UV incidence and daltonism are linked through other causal assumptions.

Together, these results put forward the role of physical factors in changing the frequency of genetic changes by generating different selective pressures in different populations. Like colour vocabulary, colour blindness can be linked to geographical location through its mediating relationship with UV incidence and eye physiology (see Illustration 6).

Relative contribution of other variables:

Subsistence strategy. Other variables also play a significant role in predicting daltonism. Corroborating with our hypothesis, subsistence strategy is slightly associated, with hunter-gatherers being less likely to have red-green abnormal colour perception. We can suggest that the selective pressure for hunter-gatherers is higher than in populations based on food production. Interestingly, hunter-gatherers in our dataset have a fairly low colour-blindness rate irrespective of UV incidence, meaning that selective pressure on colour perception may indeed exist. However, even though we attempted to have a representative sample of the world’s language, there were few examples of hunter-gatherers communities, making it difficult to conclude about their potential effect in our analysis.

Blue. Having a dedicated word for blue is significantly positively related to the frequency of red-green abnormal colour perception: population having a high red-green colour blindness rate tend to have a specific term for blue in their vocabulary. We suggest that this is, in fact, a proxy for high levels of acquired abnormal blue-green colour perception (see 1st hypothesis and Illustration 9). One can also suggest other explanations not linked to our hypothesis: for example, it is possible that red-green colour blindness has a direct impact on colour categorisation.

5. Limitations

However, our analysis has several shortcomings.

  1. Genetic distance matrices. First, genetic distance matrices computed from ALFRED were computed with only 478 microhaplotypes and SNPs and, therefore, are not representative of the whole human genome. Though, the SNPs selected were extracted from a set of AISNP (Ancestry Inference SNP), which are loci very informative for population history and migration. Also, the matching of populations in our dataset and the populations in ALFRED is imperfect (see Annexe Genetic distance matrix, section Step 4: Link populations from FST table to our dataset’s population. for more information). Also, there is no genetic frequency data concerning the opsin genes, responsible for colour perception.

  2. Location of boundaries in colour categorization. Second, our current database is not sufficient to test the claims put forward in computational studies, namely that abnormal colour perception can lead to a redefining of colour categories boundaries, as the exact location of category boundaries are not present in our dataset.

  3. Index of level of exposure to dyeing techniques. Third, this study lacks an index for population exposure to manufactured goods and their level of exposure to dyeing techniques. We hoped that some of these distinctions may be reflected in the subsistence mode, with hunter-gatherer population having less access to dyed products, and population size, where populations bigger than a few hundred thousand people being more likely to have access to dyeing techniques. Country development indexes were not relevant in our case, as we mainly studied specific populations inside countries. More generally, indexes of development for specific populations or isolated communities are mainly old-fashioned and can create ethical debates. A specific measure for exposure to dyeing technology was too specific to be found in the anthropological databases.

  4. Unique location of each population. Fifth, the different measures for distance to water (distance to rivers, distance to oceans, ditsance to lakes), as well as UV incidence, could only be computed with a fixed latitude and longitude for each population. However, population with more than a few hundreds people can occupy a large territory. Thus, we expect those variables to be a global estimation (an order of magnitude) for how distant they live from lakes, rivers or oceans, and not a precise indication.

  5. Colour blindness test. Finally, we assume that our measure of daltonism may be faulty in some ways. Even though the tests used are standardized, this measure can be submitted to variation due to different experimenters, contexts and locations. Though, out of the numerous populations where data for colour blindness was present, a meticulous and rigorous work has been made (Meeussen (2015)) to select only populations where the data was of sufficient quality.

6. Conclusion

We brought some evidence that real-world data are consistent with the hypothesis that both red-green colour blindness and the presence of a specific term for blue could be partly determined by UV incidence. Colour vocabulary is directly influenced by environmental and cultural factors, but would also be indirectly influenced through their mediating relationship with human physiology. Physiology, which has already been designated to potentially shape the structure of languages’ phonemes (Moisik & Dediu (2017); Dediu, Janssen, & Moisik (2019); Blasi et al. (2019)), is then likely to also affect languages’ semantic categories. Likewise, physiology would also be a mediator between UV incidence and inherited red-green abnormal colour perception.

Together, our findings suggest that individual-level effects (here, an acquired defect in blue-green colour range) can reshape both colour lexicons and inherited red-green abnormal colour perception. Understanding to which extent thrichromacy is positively selected depending on the subsistence strategy would give additional information to our findings concerning the distribution of abnormal red-green colour blindness in the world.

Also, it would be interesting to investigate to what extent other individual effects, such as the presence of colour-blind individuals in the population, could influence colour categories development. Specifically, it would be interesting to study closer the evolution of colour lexicons on Pingelap island. Rare cases of tetrachromatic individuals (Deeb (2005)), if they expand in the population, would also be a relevant field of investigation for colour lexicons changes.

In a cross-linguistic work, Malt & Majid (2013) provide some evidence for a great diversity in how languages label categories. Both universal and relativist factors can explain the pattern of semantic categories seen across the world. However, studying their effect through their mediating effect of physiology provide relevant leads for further exploration.

Appendix

Distance matrices

Introduction

Terms definition and explanation:

  • Fixation Index (FST) is an index measuring genetic distance between populations. It quantifies levels of breeding across populations by comparing the frequency of genes in each population in pairwise comparisons. High values for FST indicate a high population differentiation, whereas small values show a high breeding rate between populations.

  • Alfred database (Rajeevan, 2003) is an open compilation of allele frequency computed on SNP and micro-haplotypes for a set of human populations in the world.
    • In genetics, a microhaplotype is any haplotype that has very few alleles, a haplotype being a set of genetic determinants located on a single chromosome.
    • A SNP (single-nucleotide polymorphism) is a common type of genetic variation among people, and represents a substitution of a single nucleotide at a specific position in the genome. Variation depends on both individuals and populations, and leads to different genetic traits. For example, SNP variants leading to color-blind male individuals have been identified on GeneCards website (Stelzer et al., 2016).
  • Arlequin (Excoffier & Lischer, 2010) is a software for population genetics data analysis which has been used to compute FST with Alfred database. Its goal is to provide users with a large set of basic methods and statistical tests, in order to extract information on genetic and demographic features of a collection of population samples.

Method

Step 1: gather data from Alfred.

The following files have been extracted from Alfred database and FROG-kb (Rajeevan, 2003):

Panel Allele Frequency Population sample SNP Microhap
Microhaplotypes Microhap_alleleF_198.txt 96 0 198
Seldin’s list of 128 AISNPs Seldin128_alleleF.txt 70 128 0
SNPforID 34-plex SNPForId34_alleleF.txt 53 34 0
KiddLab - Set of 55 AISNPs KiddLab55_alleleF.txt 139 55 0
Kayser’s set of 24 Ancestry Informative Markers Kayser24_alleleF.txt 73 24 0
Daniele Podini’s list of 32 AISNPs Podini32_alleleF.txt 111 29 0
Eurasiaplex 23 SNP Panel Eurasiaplex23_alleleF.txt 76 23 0
Nievergelt’s Set of 41AIMs Nievergelt41_alleleF.txt 123 41 0
Overlap set of AISNPs Overlap44_alleleF.txt 72 44 0
Li’s panel of 74 AIMS Li74_alleleF.txt 67 73 0


Step 2: compute FST for each SNP/micro-haplotype.

The different text files have been gathered under one single text-file and transformed into arlequin files (.arp) through a R script (see github mathjoss/Alfred2FST/alfredtxt2arlequin.R). This R-script creates one .arp file for each micro-haplotype and each SNP.

These .arp files were then feeded to Arlequin (Excoffier & Lischer, 2010) (number of permutations=110). All Arlequin parameters used for the analysis can be found on github: mathjoss/Alfred2FST/arl_run.ars. The outputs provided by Arlequin are folders for each file.

We focused on a table called ‘Population average pairwise difference’ which contains the average number of pairwise difference between populations above diagonal, within population in the diagonal elements, and the corrected average pairwise difference below diagonal.


Step 3: Gather and mean all files.

Those tables have been averaged under one single table with two other R-scripts (see github mathjoss/Alfred2FST/mean_all_files.R and mathjoss/Alfred2FST/mean_all_files_part2.R). If some population were absent of a specific micro-haplotype or SNP, the mean was performed with missing values.

So we got a population average pairwise difference distance matrix on:

  • 145 unique populations
  • 96 unique micro-haplotypes
  • 382 unique SNP

Then, additive method and ultrametric methods (De Soete, 1984; Lapointe & Kirsch, 1995) were used for data imputation on the final distance matrix. For more details about the exact algorithm, see «A weighted least-squares approach for inferring phylogenies from incomplete distance matrices» (Makarenkov, 2004). The R-code for this process is available in github: mathjoss/Alfred2FST/extract_MDS.R. This R-file was also used to perform Multi-Dimensional Scaling (cmdscale).


References

Baddeley, A., Rubak, E., & Turner, R. (2015). Spatial point patterns: Methodology and applications with R. Retrieved from http://www.crcpress.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/9781482210200/

Bentz, C., Dediu, D., Verkerk, A., & Jäger, G. (2018). The evolution of language families is shaped by the environment beyond neutral drift. Nature Human Behaviour, 2(11), 816. https://doi.org/10.1038/s41562-018-0457-6

Berlin, B., & Kay, P. (1991). Basic Color Terms: Their Universality and Evolution. University of California Press.

Bhatia, A. (2018). The crayola-fication of the world: How we gave colour names, and it messed with our brain (part I). Retrieved from http://www.empiricalzeal.com/2012/06/05/the-crayola-fication-of-the-world-how-we-gave-colors-names-and-it-messed-with-our-brains-part-i/

Bickel, B., Nichols, J., Zakharko, T., Witzlack-Makarevich, A., Hildebrandt, K., Rießler, M., … Lowe, J. B. (2017). The AUTOTYP typological databases (Version 0.1.0) [GitHub database]. Retrieved from https://github.com/autotyp/autotyp-data/tree/0.1.0

Blasi, D. E., Moran, S., Moisik, S. R., Widmer, P., Dediu, D., & Bickel, B. (2019). Human sound systems are shaped by post-Neolithic changes in bite configuration. Science, 363(6432), eaav3218. https://doi.org/10.1126/science.aav3218

Brown, A. M., & Lindsey, D. T. (2004). Color and language: Worldwide distribution of Daltonism and distinct words for "blue". Visual Neuroscience, 21(3), 409–412. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/15518222

Bürkner, P.-C. (2018a). Advanced Bayesian Multilevel Modeling with the R Package brms. The R Journal, 10(1), 395–411. https://doi.org/10.32614/RJ-2018-017

Bürkner, P.-C. (2018b). Advanced Bayesian multilevel modeling with the R package brms. The R Journal, 10(1), 395–411. https://doi.org/10.32614/RJ-2018-017

Cysouw, M., Dediu, D., & Moran, S. (2012). Comment on “Phonemic Diversity Supports a Serial Founder Effect Model of Language Expansion from Africa”. Science, 335(6069), 657. https://doi.org/10.1126/science.1208841

Davidoff, J., Davies, I., & Roberson, D. (1999). Colour categories in a stone-age tribe. Nature, 398(6724), 203–204. https://doi.org/10.1038/18335

Dediu, D., Janssen, R., & Moisik, S. R. (2019). Weak biases emerging from vocal tract anatomy shape the repeated transmission of vowels. Nature Human Behaviour, 3(10), 1107–1115. https://doi.org/10.1038/s41562-019-0663-x

Deeb, S. (2005). The molecular basis of variation in human color vision: Variation in human color vision. Clinical Genetics, 67(5), 369–377. https://doi.org/10.1111/j.1399-0004.2004.00343.x

Derex, M., Beugin, M.-P., Godelle, B., & Raymond, M. (2013). Experimental evidence for the influence of group size on cultural complexity. Nature, 503(7476), 389–391. https://doi.org/10.1038/nature12774

De Soete, G. (1984). Ultrametric tree representations of incomplete dissimilarity data. Journal of Classification, 1(1), 235–242. https://doi.org/10.1007/BF01890124

Ember, M. (1978). Size of Color Lexicon: Interaction of Cultural and Biological Factors. American Anthropologist, 80(2), 364–367. https://doi.org/10.1525/aa.1978.80.2.02a00100

Everett, C. (2013). Evidence for Direct Geographic Influences on Linguistic Sounds: The Case of Ejectives. PLoS ONE, 8(6), e65275. https://doi.org/10.1371/journal.pone.0065275

Excoffier, L., & Lischer, H. E. L. (2010). Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources, 10(3), 564–567. https://doi.org/10.1111/j.1755-0998.2010.02847.x

Gibson, E., Futrell, R., Jara-Ettinger, J., Mahowald, K., Bergen, L., Ratnasingam, S., … Conway, B. R. (2017). Color naming across languages reflects color use. Proceedings of the National Academy of Sciences, 114(40), 10785–10790. https://doi.org/10.1073/pnas.1619666114

Goldstone, R. L. (1998). PERCEPTUAL LEARNING. Annual Review of Psychology, 49(1), 585–612. https://doi.org/10.1146/annurev.psych.49.1.585

Hammarström, H., Bank, S., Forkel, R., & Haspelmath, M. (2018). Glottolog 3.2. Retrieved from http://glottolog.org

Hardy, L., Rand, G., & Rittler, M. (1954). HRR polychromatic plates. JOSA, 44(7), 509–521.

Heider, E. R., & Olivier, D. C. (1972). The structure of the color space in naming and memory for two languages. Cognitive Psychology, 3(2), 337–354. https://doi.org/10.1016/0010-0285(72)90011-4

Henrich, J. (2004). Demography and Cultural Evolution: How Adaptive Cultural Processes Can Produce Maladaptive Losses—The Tasmanian Case. American Antiquity, 69(2), 197–214. https://doi.org/10.2307/4128416

Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674. https://doi.org/10.1198/106186006X133933

Ishihara, S. (1917). Tests for color-blindness. Tokyo: Hongo Harukicho.

Kirby, K. R., Gray, R. D., Greenhill, S. J., Jordan, F. M., Gomes-Ng, S., Bibiko, H.-J., … Gavin, M. C. (2016). D-PLACE: A Global Database of Cultural, Linguistic and Environmental Diversity. PLOS ONE, 11(7), e0158391. https://doi.org/10.1371/journal.pone.0158391

Knoll, H. A. (1968). Patent No. US3382025A. Retrieved from https://patents.google.com/patent/US3382025A/en

Komarova, N. L., Jameson, K. A., & Narens, L. (n.d.). Evolutionary Models of Color Categorization Based on Discrimination. 46.

Lapointe, F. J., & Kirsch, J. a. W. (1995). Estimating Phylogenies from Lacunose Distance Matrices, with Special Reference to DNA Hybridization Data. Molecular Biology and Evolution, 12(2), 266–266. https://doi.org/10.1093/oxfordjournals.molbev.a040209

Lindsey, D. T., & Brown, A. M. (2002). Color naming and the phototoxic effects of sunlight on the eye. Psychological Science, 13(6), 506–512.

Malt, B. C., & Majid, A. (2013). How thought is mapped into words: How thought is mapped into words. Wiley Interdisciplinary Reviews: Cognitive Science, 4(6), 583–597. https://doi.org/10.1002/wcs.1251

Meeussen, E. (2015). Colour blindness and its contribution to colour vocabulary (Master’s thesis). Radboud Universiteit Nijmegen, Nijmegen, The Netherlands.

Moisik, S. R., & Dediu, D. (2017). Anatomical biasing and clicks: Evidence from biomechanical modeling. Journal of Language Evolution, 2(1), 37–51. https://doi.org/10.1093/jole/lzx004

Moran, P. A. P. (1950). Notes on Continuous Stochastic Phenomena. Biometrika, 37(1/2), 17–23. https://doi.org/10.2307/2332142

Pfeifer, G. P., You, Y.-H., & Besaratinia, A. (2005). Mutations induced by ultraviolet light. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 571(1-2), 19–31. https://doi.org/10.1016/j.mrfmmm.2004.06.057

Powell, A., Shennan, S., & Thomas, M. G. (2009). Late Pleistocene Demography and the Appearance of Modern Human Behavior. Science, 324(5932), 1298–1301. https://doi.org/10.1126/science.1170165

Premo, L. S., & Kuhn, S. L. (2010). Modeling Effects of Local Extinctions on Culture Change and Diversity in the Paleolithic. PLoS ONE, 5(12), e15582. https://doi.org/10.1371/journal.pone.0015582

Rajeevan, H. (2003). ALFRED: The ALelle FREquency Database. Update. Nucleic Acids Research, 31(1), 270–271. https://doi.org/10.1093/nar/gkg043

R Core Team. (2020). R: A language and environment for statistical computing. Retrieved from https://www.R-project.org/

Regier, T., & Kay, P. (2009). Language, thought, and color: Whorf was half right. Trends in Cognitive Sciences, 13(10), 439–446. https://doi.org/10.1016/j.tics.2009.07.001

Roberson, D., Davidoff, J., Davies, I. R. L., & Shapiro, L. R. (2004). The Development of Color Categories in Two Languages: A Longitudinal Study. Journal of Experimental Psychology: General, 133(4), 554–571. https://doi.org/10.1037/0096-3445.133.4.554

Shennan, S. (2001). Demography and Cultural Innovation: A Model and its Implications for the Emergence of Modern Human Culture. Cambridge Archaeological Journal, 11(1), 5–16. https://doi.org/10.1017/S0959774301000014

Spielman, S. E. (2017). Point Pattern Analysis. In International Encyclopedia of Geography (pp. 1–9). https://doi.org/10.1002/9781118786352.wbieg0849

Stelzer, G., Rosen, N., Plaschkes, I., Zimmerman, S., Twik, M., Fishilevich, S., … Lancet, D. (2016). The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses. Current Protocols in Bioinformatics, 54(1). https://doi.org/10.1002/cpbi.5

Thomson, W. (1880). An instrument for the detection of color blindness. Transactions of the American Ophthalmological Society, 3, 142.

Tingley, D., Yamamoto, T., Hirose, K., Keele, L., & Imai, K. (2014). Mediation: R Package for Causal Mediation Analysis. Journal of Statistical Software, 59(1), 1–38. https://doi.org/10.18637/jss.v059.i05

Turchin, P., Brennan, R., Currie, T., Feeney, K., Francois, P., Hoyer, D., … Whitehouse, H. (2015). Seshat: The Global History Databank. Cliodynamics, 6(1). https://doi.org/10.21237/C7clio6127917

Wichmann, S., Müller, A., & Velupillai, V. (2010). Homelands of the world’s language families: A quantitative approach. Diachronica, 27(2), 247–276. https://doi.org/10.1075/dia.27.2.05wic